Yuwen Hu 7ef1c890eb [LLM] Supports GPTQ convert in transfomers-like API, and supports folder outfile for llm-convert (#8366 )

* Add docstrings to llm_convert

* Small docstrings fix

* Unify outfile type to be a folder path for either gptq or pth model_format

* Supports gptq model input for from_pretrained

* Fix example and readme

* Small fix

* Python style fix

* Bug fix in llm_convert

* Python style check

* Fix based on comments

* Small fix

2023-06-20 17:42:38 +08:00

5.4 KiB

Raw Blame History

BigDL LLM

bigdl-llm is an SDK for large language model (LLM). It helps users develop AI applications that contains LLM on Intel XPU by using less computing and memory resources.bigdl-llm utilize a highly optimized GGML on Intel XPU.

Users could use bigdl-llm to

Convert their model to lower precision
Use command line tool like llama.cpp to run the model inference
Use transformers like API to run the model inference
Integrate the model in langchain pipeline

Currently bigdl-llm has supported

Precision: INT4
Model Family: llama, gptneox, bloom
Platform: Ubuntu 20.04 or later, CentOS 7 or later, Windows 10/11
Device: CPU
Python: 3.9 (recommended) or later

Installation

BigDL-LLM is a self-contained SDK library for model loading and inferencing. Users could directly

pip install --pre --upgrade bigdl-llm

While model conversion procedure will rely on some 3rd party libraries. Add [all] option for installation to prepare environment.

pip install --pre --upgrade bigdl-llm[all]

Usage

A standard procedure for using bigdl-llm contains 3 steps:

Download model from huggingface hub
Convert model from huggingface format to GGML format
Inference using llm-cli, transformers like API, or langchain.

Convert your model

A python function and a command line tool llm-convert is provided to transform the model from huggingface format to GGML format.

Here is an example to use llm-convert command line tool.

# pth model
llm-convert "/path/to/llama-7b-hf/" --model-format pth --outfile "/path/to/llama-7b-int4/" --model-family "llama"
# gptq model
llm-convert "/path/to/vicuna-13B-1.1-GPTQ-4bit-128g/" --model-format gptq --outfile "/path/to/vicuna-13B-int4/" --model-family "llama"

An example GPTQ model can be found here

Here is an example to use llm_convert python API.

from bigdl.llm import llm_convert
# pth model
llm_convert(model="/path/to/llama-7b-hf/",
            outfile="/path/to/llama-7b-int4/",
            model_format="pth",
            model_family="llama")
# gptq model
llm_convert(model="/path/to/vicuna-13B-1.1-GPTQ-4bit-128g.pt",
            outfile="/path/to/out.bin",
            model_format="gptq",
            tokenizer_path="/path/to/tokenizer.model",
            model_family="llama")

Inferencing

llm-cli command line

llm-cli is a command-line interface tool that follows the interface as the main program in llama.cpp.

# text completion
llm-cli -t 16 -x llama -m "/path/to/llama-7b-int4/bigdl-llm-xxx.bin" -p 'Once upon a time,'

# chatting
llm-cli -t 16 -x llama -m "/path/to/llama-7b-int4/bigdl-llm-xxx.bin" -i --color

# help information
llm-cli -x llama -h

Transformers like API

Users could load converted model or even the unconverted huggingface model directly by AutoModelForCausalLM.from_pretrained.

from bigdl.llm.ggml.transformers import AutoModelForCausalLM

# option 1: load converted model
llm = AutoModelForCausalLM.from_pretrained("/path/to/llama-7b-int4/bigdl-llm-xxx.bin",
                                           model_family="llama")

# option 2: load huggingface checkpoint
llm = AutoModelForCausalLM.from_pretrained("/path/to/llama-7b-hf/",
                                           model_family="llama")

# option 3: load from huggingface hub repo
llm = AutoModelForCausalLM.from_pretrained("decapoda-research/llama-7b-hf",
                                           model_family="llama")

Users could use llm to do the inference. Apart from end-to-end fast forward, we also support split the tokenization and model inference in our API.

# end-to-end fast forward w/o spliting the tokenization and model inferencing
result = llm("what is ai")

# Use transformers tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
tokens = tokenizer("what is ai").input_ids
tokens_id = llm.generate(tokens, max_new_tokens=32)
tokenizer.batch_decode(tokens_id)

# Use bigdl-llm tokenizer
tokens = llm.tokenize("what is ai")
tokens_id = llm.generate(tokens, max_new_tokens=32)
decoded = llm.batch_decode(tokens_id)

llama-cpp-python like API

llama-cpp-python has become a popular pybinding for llama.cpp program. Some users may be familiar with this API so bigdl-llm reserve this API and extend it to other model families (e.g., gptneox, bloom)

from bigdl.llm.models import Llama, Bloom, Gptneox

llm = Llama("/path/to/llama-7b-int4/bigdl-llm-xxx.bin", n_threads=4)
result = llm("what is ai")

langchain integration

TODO

Examples

We prepared several examples in https://github.com/intel-analytics/BigDL/tree/main/python/llm/example

Dynamic library BOM

To avoid difficaulties during the installtion. bigdl-llm release the C implementation by dynamic library or executive file. The compilation details are stated below. These information is only for reference, no compilation procedure is needed for our users. GLIBC version may affect the compatibility.

Model family	Platform	Compiler	GLIBC
llama	Linux	GCC 9.4.0	2.17
llama	Windows	MSVC 19.36.32532.0
gptneox	Linux	GCC 9.4.0	2.17
gptneox	Windows	MSVC 19.36.32532.0
bloom	Linux	GCC 9.4.0	2.31
bloom	Windows	MSVC 19.36.32532.0

5.4 KiB Raw Blame History