|
|
||
|---|---|---|
| .. | ||
| dev | ||
| example | ||
| src/bigdl | ||
| test | ||
| README.md | ||
| setup.py | ||
BigDL-LLM
bigdl-llm is a library for running LLM (language language model) on your Intel laptop using INT4 with very low latency (for any Hugging Face Transformers model).
(It is built on top of the excellent work of llama.cpp, gptq, ggml, llama-cpp-python, gptq_for_llama, bitsandbytes, redpajama.cpp, gptneox.cpp, bloomz.cpp, etc.)
Demos
See the optimized performance of phoenix-inst-chat-7b, vicuna-13b-v1.1, and starcoder-15b models on a 12th Gen Intel Core CPU below.
Working with bigdl-llm
Table of Contents
Install
You may install bigdl-llm as follows:
pip install --pre --upgrade bigdl-llm[all]
Download Model
You may download any PyTorch model in Hugging Face Transformers format (including FP16 or FP32 or GPTQ-4bit).
Run Model
You may run the models using bigdl-llm through one of the following APIs:
- CLI (command line interface) Tool
- Hugging Face
transformers-style API - LangChain API
llama-cpp-python-style API
CLI Tool
Note
: Currently
bigdl-llmCLI supports LLaMA (e.g., vicuna), GPT-NeoX (e.g., redpajama), BLOOM (e.g., pheonix) and GPT2 (e.g., starcoder) model architecture; for other models, you may use thetransformers-style or LangChain APIs.
-
Convert model
You may convert the downloaded model into native INT4 format using
llm-convert.#convert PyTorch (fp16 or fp32) model; #llama/bloom/gptneox/starcoder model family is currently supported llm-convert "/path/to/model/" --model-format pth --model-family "bloom" --outfile "/path/to/output/" #convert GPTQ-4bit model #only llama model family is currently supported llm-convert "/path/to/model/" --model-format gptq --model-family "llama" --outfile "/path/to/output/" -
Run model
You may run the converted model using
llm-cliorllm-chat(built on top ofmain.cppin llama.cpp)#help #llama/bloom/gptneox/starcoder model family is currently supported llm-cli -x gptneox -h #text completion #llama/bloom/gptneox/starcoder model family is currently supported llm-cli -t 16 -x gptneox -m "/path/to/output/model.bin" -p 'Once upon a time,' #chat mode #llama/gptneox model family is currently supported llm-chat -m "/path/to/output/model.bin" -x llama
Hugging Face transformers-style API
You may run the models using transformers-style API in bigdl-llm.
-
Using Hugging Face
transformersINT4 formatYou may apply INT4 optimizations to any Hugging Face Transformers models as follows.
#load Hugging Face Transformers model with INT4 optimizations from bigdl.llm.transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True) #run the optimized model from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(model_path) input_ids = tokenizer.encode(input_str, ...) output_ids = model.generate(input_ids, ...) output = tokenizer.batch_decode(output_ids)See the complete example here.
Notice: For more quantized precision, you can use another parameter
load_in_low_bit. Available types aresym_int4,asym_int4,sym_int5,asym_int5andsym_int8.model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int5") -
Using native INT4 format
You may also convert Hugging Face Transformers models into native INT4 format for maximum performance as follows.
Note
: Currently only llama/bloom/gptneox/starcoder model family is supported; for other models, you may use the Transformers INT4 format as described above).
#convert the model from bigdl.llm import llm_convert bigdl_llm_path = llm_convert(model='/path/to/model/', outfile='/path/to/output/', outtype='int4', model_family="llama") #load the converted model from bigdl.llm.transformers import BigdlNativeForCausalLM llm = BigdlNativeForCausalLM.from_pretrained("/path/to/output/model.bin",...) #run the converted model input_ids = llm.tokenize(prompt) output_ids = llm.generate(input_ids, ...) output = llm.batch_decode(output_ids)See the complete example here.
LangChain API
You may run the models using the LangChain API in bigdl-llm.
-
Using Hugging Face
transformersINT4 formatYou may run any Hugging Face Transformers model (with INT4 optimiztions applied) using the LangChain API as follows:
from bigdl.llm.langchain.llms import TransformersLLM from bigdl.llm.langchain.embeddings import TransformersEmbeddings from langchain.chains.question_answering import load_qa_chain embeddings = TransformersEmbeddings.from_model_id(model_id=model_path) bigdl_llm = TransformersLLM.from_model_id(model_id=model_path, ...) doc_chain = load_qa_chain(bigdl_llm, ...) output = doc_chain.run(...)See the examples here.
-
Using native INT4 format
You may also convert Hugging Face Transformers models into native INT4 format (currently only llama/bloom/gptneox/starcoder model family is supported), and then run the converted models using the LangChain API as follows.
Note
: Currently only llama/bloom/gptneox/starcoder model family is supported; for other models, you may use the Transformers INT4 format as described above).
from bigdl.llm.langchain.llms import BigdlNativeLLM from bigdl.llm.langchain.embeddings import BigdlNativeEmbeddings from langchain.chains.question_answering import load_qa_chain embeddings = BigdlNativeEmbeddings(model_path='/path/to/converted/model.bin', model_family="llama",...) bigdl_llm = BigdlNativeLLM(model_path='/path/to/converted/model.bin', model_family="llama",...) doc_chain = load_qa_chain(bigdl_llm, ...) doc_chain.run(...)See the examples here.
llama-cpp-python-style API
You may also run the converted models using the llama-cpp-python-style API in bigdl-llm as follows.
from bigdl.llm.models import Llama, Bloom, Gptneox
llm = Bloom("/path/to/converted/model.bin", n_threads=4)
result = llm("what is ai")
bigdl-llm Dependence
The native code/lib in bigdl-llm has been built using the following tools; in particular, lower LIBC version on your Linux system may be incompatible with bigdl-llm.
| Model family | Platform | Compiler | GLIBC |
|---|---|---|---|
| llama | Linux | GCC 11.2.1 | 2.17 |
| llama | Windows | MSVC 19.36.32532.0 | |
| llama | Windows | GCC 13.1.0 | |
| gptneox | Linux | GCC 11.2.1 | 2.17 |
| gptneox | Windows | MSVC 19.36.32532.0 | |
| gptneox | Windows | GCC 13.1.0 | |
| bloom | Linux | GCC 11.2.1 | 2.29 |
| bloom | Windows | MSVC 19.36.32532.0 | |
| bloom | Windows | GCC 13.1.0 | |
| starcoder | Linux | GCC 11.2.1 | 2.29 |
| starcoder | Windows | MSVC 19.36.32532.0 | |
| starcoder | Windows | GCC 13.1.0 |