diff --git a/README.md b/README.md index 1e1add49..ef40fdbe 100644 --- a/README.md +++ b/README.md @@ -9,7 +9,7 @@ _**Fast, Distributed, Secure AI for Big Data**_ --- ## Latest News -- **Try the latest [`bigdl-llm`](python/llm) for running LLM (language language model) on your Intel laptop using INT4 with very low latency!**[^1] *(It is built on top of the excellent work of [llama.cpp](https://github.com/ggerganov/llama.cpp), [gptq](https://github.com/IST-DASLab/gptq), [bitsandbytes](https://github.com/TimDettmers/bitsandbytes), etc., and supports any Hugging Face Transformers model)* +- **Try the latest [`bigdl-llm`](python/llm) for running LLM (large language model) on your Intel laptop using INT4 with very low latency!**[^1] *(It is built on top of the excellent work of [llama.cpp](https://github.com/ggerganov/llama.cpp), [gptq](https://github.com/IST-DASLab/gptq), [bitsandbytes](https://github.com/TimDettmers/bitsandbytes), etc., and supports any Hugging Face Transformers model)*

diff --git a/docs/readthedocs/source/doc/PythonAPI/LLM/transformers.rst b/docs/readthedocs/source/doc/PythonAPI/LLM/transformers.rst index 10438632..519b2d27 100644 --- a/docs/readthedocs/source/doc/PythonAPI/LLM/transformers.rst +++ b/docs/readthedocs/source/doc/PythonAPI/LLM/transformers.rst @@ -48,7 +48,45 @@ llm.transformers.model llm.transformers.modelling_bigdl ---------------------------------------- -.. automodule:: bigdl.llm.transformers.modelling_bigdl +.. autoclass:: bigdl.llm.transformers.modelling_bigdl.LlamaForCausalLM :members: :undoc-members: :show-inheritance: + :exclude-members: GGML_Model, GGML_Module, HF_Class + + .. automethod:: from_pretrained + + +.. autoclass:: bigdl.llm.transformers.modelling_bigdl.ChatGLMForCausalLM + :members: + :undoc-members: + :show-inheritance: + :exclude-members: GGML_Model, GGML_Module, HF_Class + + .. automethod:: from_pretrained + + +.. autoclass:: bigdl.llm.transformers.modelling_bigdl.GptneoxForCausalLM + :members: + :undoc-members: + :show-inheritance: + :exclude-members: GGML_Model, GGML_Module, HF_Class + + .. automethod:: from_pretrained + + +.. autoclass:: bigdl.llm.transformers.modelling_bigdl.BloomForCausalLM + :members: + :undoc-members: + :show-inheritance: + :exclude-members: GGML_Model, GGML_Module, HF_Class + + .. automethod:: from_pretrained + +.. autoclass:: bigdl.llm.transformers.modelling_bigdl.StarcoderForCausalLM + :members: + :undoc-members: + :show-inheritance: + :exclude-members: GGML_Model, GGML_Module, HF_Class + + .. automethod:: from_pretrained diff --git a/python/llm/README.md b/python/llm/README.md index 84ee4f2b..34a8b4da 100644 --- a/python/llm/README.md +++ b/python/llm/README.md @@ -1,6 +1,6 @@ ## BigDL-LLM -**`bigdl-llm`** is a library for running ***LLM*** (language language model) on your Intel ***laptop*** using INT4 with very low latency[^1] (for any Hugging Face *Transformers* model). +**`bigdl-llm`** is a library for running ***LLM*** (large language model) on your Intel ***laptop*** using INT4 with very low latency[^1] (for any Hugging Face *Transformers* model). >*(It is built on top of the excellent work of [llama.cpp](https://github.com/ggerganov/llama.cpp), [gptq](https://github.com/IST-DASLab/gptq), [ggml](https://github.com/ggerganov/ggml), [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), [gptq_for_llama](https://github.com/qwopqwop200/GPTQ-for-LLaMa), [bitsandbytes](https://github.com/TimDettmers/bitsandbytes), [chatglm.cpp](https://github.com/li-plus/chatglm.cpp), [redpajama.cpp](https://github.com/togethercomputer/redpajama.cpp), [gptneox.cpp](https://github.com/byroneverson/gptneox.cpp), [bloomz.cpp](https://github.com/NouamaneTazi/bloomz.cpp/), etc.)* @@ -76,7 +76,11 @@ You may run the models using `transformers`-style API in `bigdl-llm`. #load Hugging Face Transformers model with INT4 optimizations from bigdl.llm.transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True) + ``` + After loading the Hugging Face Transformers model, you may easily run the optimized model as follows. + + ```python #run the optimized model from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(model_path) @@ -88,13 +92,14 @@ You may run the models using `transformers`-style API in `bigdl-llm`. See the complete examples [here](example/transformers/transformers_int4/). >**Note**: You may apply more low bit optimizations (including INT8, INT5 and INT4) as follows: - >```python - >model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int5") - >``` - >See the complete example [here](example/transformers/transformers_low_bit/). - + >```python + >model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int5") + >``` + >See the complete example [here](example/transformers/transformers_low_bit/). - After the model is optimizaed using INT4 (or INT5/INT8), you may save and load the optimized model as follows: + + After the model is optimizaed using INT4 (or INT8/INT5), you may save and load the optimized model as follows: + ```python model.save_low_bit(model_path) @@ -106,19 +111,24 @@ You may run the models using `transformers`-style API in `bigdl-llm`. You may also convert Hugging Face *Transformers* models into native INT4 format for maximum performance as follows. - >**Note**: Currently only llama/bloom/gptneox/starcoder model family is supported; for other models, you may use the Transformers INT4 format as described above). + >**Notes**: + + * Currently only llama/bloom/gptneox/starcoder/chatglm model families are supported; for other models, you may use the Transformers INT4 format as described above). + + * You may choose the corresponding API developed for specific native models to load the converted model. - ```python + ```python #convert the model from bigdl.llm import llm_convert bigdl_llm_path = llm_convert(model='/path/to/model/', outfile='/path/to/output/', outtype='int4', model_family="llama") #load the converted model - from bigdl.llm.transformers import BigdlNativeForCausalLM - llm = BigdlNativeForCausalLM.from_pretrained("/path/to/output/model.bin",...) - - #run the converted model + #switch to ChatGLMForCausalLM/GptneoxForCausalLM/BloomForCausalLM/StarcoderForCausalLM to load other models + from bigdl.llm.transformers import LlamaForCausalLM + llm = LlamaForCausalLM.from_pretrained("/path/to/output/model.bin", ...) + + #run the converted model input_ids = llm.tokenize(prompt) output_ids = llm.generate(input_ids, ...) output = llm.batch_decode(output_ids) @@ -243,8 +253,9 @@ See the inital `bigdl-llm` API Doc [here](https://bigdl.readthedocs.io/en/latest [^1]: Performance varies by use, configuration and other factors. `bigdl-llm` may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex. -### `bigdl-llm` Dependence -The native code/lib in `bigdl-llm` has been built using the following tools; in particular, lower `LIBC` version on your Linux system may be incompatible with `bigdl-llm`. +### `bigdl-llm` Dependencies +The native code/lib in `bigdl-llm` has been built using the following tools. +Note that lower `LIBC` version on your Linux system may be incompatible with `bigdl-llm`. | Model family | Platform | Compiler | GLIBC | | ------------ | -------- | ------------------ | ----- | diff --git a/python/llm/src/bigdl/llm/transformers/modelling_bigdl.py b/python/llm/src/bigdl/llm/transformers/modelling_bigdl.py index 098de5e8..4c7ba671 100644 --- a/python/llm/src/bigdl/llm/transformers/modelling_bigdl.py +++ b/python/llm/src/bigdl/llm/transformers/modelling_bigdl.py @@ -19,7 +19,9 @@ # Otherwise there would be module not found error in non-pip's setting as Python would # only search the first bigdl package and end up finding only one sub-package. +import importlib import logging + from bigdl.llm.utils.common import invalidInputError from .model import * @@ -107,42 +109,53 @@ class _BaseGGMLClass: :return: a model instance """ - if native: - invalidInputError(dtype.lower() in ['int4', 'int8'], - "Now we only support int4 and int8 as date type for weight") - ggml_model_path = pretrained_model_name_or_path - return cls.GGML_Model(model_path=ggml_model_path, - **kwargs) - else: - return cls.HF_Class.from_pretrained(pretrained_model_name_or_path, - *args, **kwargs) + try: + module = importlib.import_module(cls.GGML_Module) + class_ = getattr(module, cls.GGML_Model) + if native: + invalidInputError(dtype.lower() in ['int4', 'int8'], + "Now we only support int4 and int8 as date type for weight") + ggml_model_path = pretrained_model_name_or_path + model = class_(model_path=ggml_model_path, **kwargs) + else: + model = cls.HF_Class.from_pretrained(pretrained_model_name_or_path, + *args, **kwargs) + except Exception as e: + invalidInputError( + False, + f"Could not load model from path: {pretrained_model_name_or_path}. " + f"Please make sure the CausalLM class matches " + "the model you want to load." + f"Received error {e}" + ) + return model class LlamaForCausalLM(_BaseGGMLClass): - from bigdl.llm.ggml.model.llama import Llama - GGML_Model = Llama + GGML_Module = "bigdl.llm.models" + GGML_Model = "Llama" HF_Class = AutoModelForCausalLM class ChatGLMForCausalLM(_BaseGGMLClass): - from bigdl.llm.ggml.model.chatglm import ChatGLM - GGML_Model = ChatGLM + GGML_Module = "bigdl.llm.ggml.model.chatglm" + GGML_Model = "ChatGLM" HF_Class = AutoModel class GptneoxForCausalLM(_BaseGGMLClass): - from bigdl.llm.ggml.model.gptneox import Gptneox - GGML_Model = Gptneox + GGML_Module = "bigdl.llm.models" + GGML_Model = "Gptneox" HF_Class = AutoModelForCausalLM class BloomForCausalLM(_BaseGGMLClass): - from bigdl.llm.ggml.model.bloom import Bloom - GGML_Model = Bloom + GGML_Module = "bigdl.llm.models" + GGML_Model = "Bloom" HF_Class = AutoModelForCausalLM class StarcoderForCausalLM(_BaseGGMLClass): - from bigdl.llm.ggml.model.starcoder import Starcoder - GGML_Model = Starcoder + GGML_Module = "bigdl.llm.models" + GGML_Model = "Starcoder" HF_Class = AutoModelForCausalLM