ipex-llm/python/llm
Yina Chen f2bb469847 [WIP] LLm llm-cli chat mode (#8440)
* fix timezone

* temp

* Update linux interactive mode

* modify init text for interactive mode

* meet comments

* update

* win script

* meet comments
2023-07-05 14:04:17 +08:00
..
dev LLM: add a dev tool for getting glibc/glibcxx requirement (#8399) 2023-06-30 11:09:50 +08:00
example LLM: add readme for transformer examples (#8444) 2023-07-04 17:25:58 +08:00
src/bigdl [WIP] LLm llm-cli chat mode (#8440) 2023-07-05 14:04:17 +08:00
test [LLM] Change default runner for LLM Linux tests to the ones with AVX512 (#8448) 2023-07-04 14:53:03 +08:00
README.md [WIP] LLm llm-cli chat mode (#8440) 2023-07-05 14:04:17 +08:00
setup.py [WIP] LLm llm-cli chat mode (#8440) 2023-07-05 14:04:17 +08:00

BigDL-LLM

bigdl-llm is a library for running LLM (language language model) on your Intel laptop using INT4 with very low latency (for any Hugging Face Transformers model).

(It is built on top of the excellent work of llama.cpp, gptq, ggml, llama-cpp-python, gptq_for_llama, bitsandbytes, redpajama.cpp, gptneox.cpp, bloomz.cpp, etc.)

Demos

See the optimized performance of phoenix-inst-chat-7b, vicuna-13b-v1.1, and starcoder-15b models on a 12th Gen Intel Core CPU below.

Working with bigdl-llm

Table of Contents

Install

You may install bigdl-llm as follows:

pip install --pre --upgrade bigdl-llm[all]

Download Model

You may download any PyTorch model in Hugging Face Transformers format (including FP16 or FP32 or GPTQ-4bit).

Run Model

You may run the models using bigdl-llm through one of the following APIs:

  1. CLI (command line interface) Tool
  2. Hugging Face transformer-style API
  3. LangChain API
  4. llama-cpp-python-style API

CLI Tool

Currently bigdl-llm CLI supports LLaMA (e.g., vicuna), GPT-NeoX (e.g., redpajama), BLOOM (e.g., pheonix) and GPT2 (e.g., starcoder) model architecture; for other models, you may use the transformer-style or LangChain APIs.

  • Convert model

    You may convert the downloaded model into native INT4 format using llm-convert.

    #convert PyTorch (fp16 or fp32) model; 
    #llama/bloom/gptneox/starcoder model family is currently supported
    lm-convert "/path/to/model/" --model-format pth --model-family "bloom" --outfile "/path/to/output/"
    
    #convert GPTQ-4bit model
    #only llama model family is currently supported
    llm-convert "/path/to/model/" --model-format gptq --model-family "llama" --outfile "/path/to/output/"
    
  • Run model

    You may run the converted model using llm-cli (built on top of main.cpp in llama.cpp)

    #help
    #llama/bloom/gptneox/starcoder model family is currently supported
    llm-cli -x gptneox -h
    
    #text completion
    #llama/bloom/gptneox/starcoder model family is currently supported
    llm-cli -t 16 -x gptneox -m "/path/to/output/model.bin" -p 'Once upon a time,'
    
    #chat mode
    #Note: The chat mode only support LLaMA (e.g., *vicuna*), GPT-NeoX (e.g., *redpajama*)for now.
    llm-chat -m "/path/to/output/model.bin" -x llama
    

Hugging Face transformers-style API

You may run the models using transformers-style API in bigdl-llm.

  • Using Hugging Face transformers INT4 format

    You may apply INT4 optimizations to any Hugging Face Transformers models as follows.

    #load Hugging Face Transformers model with INT4 optimizations
    from bigdl.llm.transformers import AutoModelForCausalLM
    model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)
    
    #run the optimized model
    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    input_ids = tokenizer.encode(input_str, ...)
    output_ids = model.generate(input_ids, ...)
    output = tokenizer.batch_decode(output_ids)
    

    See the complete example here.

    • Using native INT4 format

    You may also convert Hugging Face Transformers models into native INT4 format for maximum performance as follows.

    (Currently only llama/bloom/gptneox/starcoder model family is supported; for other models, you may use the Transformers INT4 format as described above).

    #convert the model
    from bigdl.llm import llm_convert
    bigdl_llm_path = llm_convert(model='/path/to/model/',
       outfile='/path/to/output/', outtype='int4', model_family="llama")
    
    #load the converted model
    from bigdl.llm.transformers import BigdlForCausalLM
    llm = BigdlForCausalLM.from_pretrained("/path/to/output/model.bin",...)
    
    #run the converted  model
    input_ids = llm.tokenize(prompt)
    output_ids = llm.generate(input_ids, ...)
    output = llm.batch_decode(output_ids)
    

    See the complete example here.

LangChain API

You may convert Hugging Face Transformers models into native INT4 format (currently only llama/bloom/gptneox/starcoder model family is supported), and then run the converted models using the LangChain API in bigdl-llm as follows.

from bigdl.llm.langchain.llms import BigdlLLM
from bigdl.llm.langchain.embeddings import BigdlLLMEmbeddings
from langchain.chains.question_answering import load_qa_chain

embeddings = BigdlLLMEmbeddings(model_path='/path/to/converted/model.bin',
                                model_family="llama",...)
bigdl_llm = BigdlLLM(model_path='/path/to/converted/model.bin',
                     model_family="llama",...)

doc_chain = load_qa_chain(bigdl_llm, ...)
doc_chain.run(...)

See the examples here.

llama-cpp-python-style API

You may also run the converted models using the llama-cpp-python-style API in bigdl-llm as follows.

from bigdl.llm.models import Llama, Bloom, Gptneox

llm = Bloom("/path/to/converted/model.bin", n_threads=4)
result = llm("what is ai")

bigdl-llm Dependence

The native code/lib in bigdl-llm has been built using the following tools; in particular, lower LIBC version on your Linux system may be incompatible with bigdl-llm.

Model family Platform Compiler GLIBC
llama Linux GCC 9.3.1 2.17
llama Windows MSVC 19.36.32532.0
gptneox Linux GCC 9.3.1 2.17
gptneox Windows MSVC 19.36.32532.0
bloom Linux GCC 9.4.0 2.29
bloom Windows MSVC 19.36.32532.0
starcoder Linux GCC 9.4.0 2.29
starcoder Windows MSVC 19.36.32532.0