ipex-llm/python/llm
2023-08-25 10:55:01 +08:00
..
dev LLM: reduce GPU 1st token latency and update example (#8763) 2023-08-16 18:01:23 +08:00
example LLM: update chatglm example to be more friendly for beginners (#8795) 2023-08-25 10:55:01 +08:00
src/bigdl Optimize chatglm2 for bf16 (#8725) 2023-08-24 10:04:25 -07:00
test [LLM] Merge windows & linux nightly test (#8756) 2023-08-23 12:48:41 +09:00
.gitignore [LLM] add chatglm pybinding binary file release (#8677) 2023-08-04 11:45:27 +08:00
README.md Update llm document (#8784) 2023-08-21 22:34:44 +08:00
setup.py [LLM] Add bigdl-core-xe dependency when installing bigdl-llm[xpu] (#8716) 2023-08-10 17:41:42 +09:00

BigDL-LLM

bigdl-llm is a library for running LLM (large language model) on your Intel laptop using INT4 with very low latency1 (for any Hugging Face Transformers model).

(It is built on top of the excellent work of llama.cpp, gptq, ggml, llama-cpp-python, gptq_for_llama, bitsandbytes, chatglm.cpp, redpajama.cpp, gptneox.cpp, bloomz.cpp, etc.)

Demos

See the optimized performance of chatglm2-6b, llama-2-13b-chat, and starcoder-15b models on a 12th Gen Intel Core CPU below.

Verified models

We may use any Hugging Face Transfomer models on bigdl-llm, and the following models have been verified on Intel laptops.

Model Example
LLaMA (such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.) link1, link2
LLaMA 2 link
MPT link
Falcon link
ChatGLM link
ChatGLM2 link
Qwen link
MOSS link
Baichuan link
Dolly-v1 link
Dolly-v2 link
RedPajama link1, link2
Phoenix link1, link2
StarCoder link1, link2
InternLM link
Whisper link

Working with bigdl-llm

Table of Contents

Install

You may install bigdl-llm as follows:

pip install --pre --upgrade bigdl-llm[all]

Download Model

You may download any PyTorch model in Hugging Face Transformers format (including FP16 or FP32 or GPTQ-4bit).

Run Model

You may run the models using bigdl-llm through one of the following APIs:

  1. Hugging Face transformers API
  2. LangChain API
  3. CLI (command line interface) Tool

Hugging Face transformers API

You may run the models using transformers-style API in bigdl-llm.

  • Using Hugging Face transformers INT4 format

    You may apply INT4 optimizations to any Hugging Face Transformers models as follows.

    #load Hugging Face Transformers model with INT4 optimizations
    from bigdl.llm.transformers import AutoModelForCausalLM
    model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)
    

    After loading the Hugging Face Transformers model, you may easily run the optimized model as follows.

    #run the optimized model
    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    input_ids = tokenizer.encode(input_str, ...)
    output_ids = model.generate(input_ids, ...)
    output = tokenizer.batch_decode(output_ids)
    

    See the complete examples here.

    Note

    : You may apply more low bit optimizations (including INT8, INT5 and INT4) as follows:

    model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int5")
    

    See the complete example here.

    After the model is optimizaed using INT4 (or INT8/INT5), you may save and load the optimized model as follows:

    model.save_low_bit(model_path)
    
    new_model = AutoModelForCausalLM.load_low_bit(model_path)
    

    See the example here.

  • Using native INT4 format

    You may also convert Hugging Face Transformers models into native INT4 format for maximum performance as follows.

    Notes: Currently only llama/bloom/gptneox/starcoder/chatglm model families are supported; you may use the corresponding API to load the converted model. (For other models, you can use the Transformers INT4 format as described above).

    #convert the model
    from bigdl.llm import llm_convert
    bigdl_llm_path = llm_convert(model='/path/to/model/',
           outfile='/path/to/output/', outtype='int4', model_family="llama")
    
    #load the converted model
    #switch to ChatGLMForCausalLM/GptneoxForCausalLM/BloomForCausalLM/StarcoderForCausalLM to load other models
    from bigdl.llm.transformers import LlamaForCausalLM
    llm = LlamaForCausalLM.from_pretrained("/path/to/output/model.bin", native=True, ...)
    
    #run the converted model
    input_ids = llm.tokenize(prompt)
    output_ids = llm.generate(input_ids, ...)
    output = llm.batch_decode(output_ids)
    

    See the complete example here.

LangChain API

You may run the models using the LangChain API in bigdl-llm.

  • Using Hugging Face transformers INT4 format

    You may run any Hugging Face Transformers model (with INT4 optimiztions applied) using the LangChain API as follows:

    from bigdl.llm.langchain.llms import TransformersLLM
    from bigdl.llm.langchain.embeddings import TransformersEmbeddings
    from langchain.chains.question_answering import load_qa_chain
    
    embeddings = TransformersEmbeddings.from_model_id(model_id=model_path)
    bigdl_llm = TransformersLLM.from_model_id(model_id=model_path, ...)
    
    doc_chain = load_qa_chain(bigdl_llm, ...)
    output = doc_chain.run(...)
    

    See the examples here.

  • Using native INT4 format

    You may also convert Hugging Face Transformers models into native INT4 format (currently only llama/bloom/gptneox/starcoder model family is supported), and then run the converted models using the LangChain API as follows.

    Note

    : Currently only llama/bloom/gptneox/starcoder model family is supported; for other models, you may use the Transformers INT4 format as described above).

    from bigdl.llm.langchain.llms import BigdlNativeLLM
    from bigdl.llm.langchain.embeddings import BigdlNativeEmbeddings
    from langchain.chains.question_answering import load_qa_chain
    
    embeddings = BigdlNativeEmbeddings(model_path='/path/to/converted/model.bin',
                              model_family="llama",...)
    bigdl_llm = BigdlNativeLLM(model_path='/path/to/converted/model.bin',
                         model_family="llama",...)
    
    doc_chain = load_qa_chain(bigdl_llm, ...)
    doc_chain.run(...)
    

    See the examples here.

CLI Tool

Note

: Currently bigdl-llm CLI supports LLaMA (e.g., vicuna), GPT-NeoX (e.g., redpajama), BLOOM (e.g., pheonix) and GPT2 (e.g., starcoder) model architecture; for other models, you may use the transformers-style or LangChain APIs.

  • Convert model

    You may convert the downloaded model into native INT4 format using llm-convert.

    #convert PyTorch (fp16 or fp32) model; 
    #llama/bloom/gptneox/starcoder model family is currently supported
    llm-convert "/path/to/model/" --model-format pth --model-family "bloom" --outfile "/path/to/output/"
    
    #convert GPTQ-4bit model
    #only llama model family is currently supported
    llm-convert "/path/to/model/" --model-format gptq --model-family "llama" --outfile "/path/to/output/"
    
  • Run model

    You may run the converted model using llm-cli or llm-chat (built on top of main.cpp in llama.cpp)

    #help
    #llama/bloom/gptneox/starcoder model family is currently supported
    llm-cli -x gptneox -h
    
    #text completion
    #llama/bloom/gptneox/starcoder model family is currently supported
    llm-cli -t 16 -x gptneox -m "/path/to/output/model.bin" -p 'Once upon a time,'
    
    #chat mode
    #llama/gptneox model family is currently supported
    llm-chat -m "/path/to/output/model.bin" -x llama
    

CLI Tool

Note

: Currently bigdl-llm CLI supports LLaMA (e.g., vicuna), GPT-NeoX (e.g., redpajama), BLOOM (e.g., pheonix) and GPT2 (e.g., starcoder) model architecture; for other models, you may use the transformers-style or LangChain APIs.

  • Convert model

    You may convert the downloaded model into native INT4 format using llm-convert.

    #convert PyTorch (fp16 or fp32) model; 
    #llama/bloom/gptneox/starcoder model family is currently supported
    llm-convert "/path/to/model/" --model-format pth --model-family "bloom" --outfile "/path/to/output/"
    
    #convert GPTQ-4bit model
    #only llama model family is currently supported
    llm-convert "/path/to/model/" --model-format gptq --model-family "llama" --outfile "/path/to/output/"
    
  • Run model

    You may run the converted model using llm-cli or llm-chat (built on top of main.cpp in llama.cpp)

    #help
    #llama/bloom/gptneox/starcoder model family is currently supported
    llm-cli -x gptneox -h
    
    #text completion
    #llama/bloom/gptneox/starcoder model family is currently supported
    llm-cli -t 16 -x gptneox -m "/path/to/output/model.bin" -p 'Once upon a time,'
    
    #chat mode
    #llama/gptneox model family is currently supported
    llm-chat -m "/path/to/output/model.bin" -x llama
    

bigdl-llm API Doc

See the inital bigdl-llm API Doc here.

bigdl-llm Dependencies

The native code/lib in bigdl-llm has been built using the following tools. Note that lower LIBC version on your Linux system may be incompatible with bigdl-llm.

Model family Platform Compiler GLIBC
llama Linux GCC 11.2.1 2.17
llama Windows MSVC 19.36.32532.0
llama Windows GCC 13.1.0
gptneox Linux GCC 11.2.1 2.17
gptneox Windows MSVC 19.36.32532.0
gptneox Windows GCC 13.1.0
bloom Linux GCC 11.2.1 2.29
bloom Windows MSVC 19.36.32532.0
bloom Windows GCC 13.1.0
starcoder Linux GCC 11.2.1 2.29
starcoder Windows MSVC 19.36.32532.0
starcoder Windows GCC 13.1.0

  1. Performance varies by use, configuration and other factors. bigdl-llm may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex. ↩︎