Add tokenizer_id in Langchain (#10588)

* fix low-bit * fix * fix style --------- Co-authored-by: arda <arda@arda-arc12.sh.intel.com>
2024-04-03 14:25:35 +08:00 · 2024-04-03 14:25:35 +08:00 · b827f534d5
commit b827f534d5
parent f6fef09933
5 changed files with 195 additions and 27 deletions
--- a/python/llm/example/CPU/LangChain/README.md
+++ b/python/llm/example/CPU/LangChain/README.md
@ -18,47 +18,47 @@ pip install -U pandas==2.0.3

 ### Example: Chat

-The chat example ([chat.py](./transformers_int4/chat.py)) shows how to use `LLMChain` to build a chat pipeline. 
+The chat example ([chat.py](./chat.py)) shows how to use `LLMChain` to build a chat pipeline. 

 To run the example, execute the following command in the current directory:

 ```bash
-python transformers_int4/chat.py -m <path_to_model> [-q <your_question>]
+python chat.py -m <path_to_model> [-q <your_question>]
 ```
 > Note: if `-q` is not specified, it will use `What is AI` by default. 

 ### Example: RAG (Retrival Augmented Generation) 

-The RAG example ([rag.py](./transformers_int4/rag.py)) shows how to load the input text into vector database,  and then use `load_qa_chain` to build a retrival pipeline.
+The RAG example ([rag.py](./rag.py)) shows how to load the input text into vector database,  and then use `load_qa_chain` to build a retrival pipeline.

 To run the example, execute the following command in the current directory:

 ```bash
-python transformers_int4/rag.py -m <path_to_model> [-q <your_question>] [-i <path_to_input_txt>]
+python rag.py -m <path_to_model> [-q <your_question>] [-i <path_to_input_txt>]
 ```
 > Note: If `-i` is not specified, it will use a short introduction to Big-DL as input by default. if `-q` is not specified, `What is IPEX LLM?` will be used by default. 


 ### Example: Math

-The math example ([math.py](./transformers_int4/llm_math.py)) shows how to build a chat pipeline specialized in solving math questions. For example, you can ask `What is 13 raised to the .3432 power?`
+The math example ([math.py](./llm_math.py)) shows how to build a chat pipeline specialized in solving math questions. For example, you can ask `What is 13 raised to the .3432 power?`

 To run the exmaple, execute the following command in the current directory:

 ```bash
-python transformers_int4/llm_math.py -m <path_to_model> [-q <your_question>]
+python llm_math.py -m <path_to_model> [-q <your_question>]
 ```
 > Note: if `-q` is not specified, it will use `What is 13 raised to the .3432 power?` by default. 


 ### Example: Voice Assistant

-The voice assistant example ([voiceassistant.py](./transformers_int4/voiceassistant.py)) showcases how to use langchain to build a pipeline that takes in your speech as input in realtime, use an ASR model (e.g. [Whisper-Medium](https://huggingface.co/openai/whisper-medium)) to turn speech into text, and then feed the text into large language model to get response.  
+The voice assistant example ([voiceassistant.py](./voiceassistant.py)) showcases how to use langchain to build a pipeline that takes in your speech as input in realtime, use an ASR model (e.g. [Whisper-Medium](https://huggingface.co/openai/whisper-medium)) to turn speech into text, and then feed the text into large language model to get response.  

 To run the exmaple, execute the following command in the current directory:

 ```bash
-python transformers_int4/voiceassistant.py -m <path_to_model> [-q <your_question>]
+python voiceassistant.py -m <path_to_model> [-q <your_question>]
 ```
 **Runtime Arguments Explained**:
 - `-m MODEL_PATH`: **Required**, the path to the 
@ -67,6 +67,23 @@ python transformers_int4/voiceassistant.py -m <path_to_model> [-q <your_question
 - `-l LANGUAGE`: you can specify a language such as "english" or "chinese" 
 - `-d True|False`: whether the model path specified in -m is saved low bit model.

+
+### Example: Low Bit
+
+The low_bit example ([low_bit.py](./low_bit.py)) showcases how to use use langchain with low_bit optimized model.
+By `save_low_bit` we save the weights of low_bit model into the target folder.
+> Note: `save_low_bit` only saves the weights of the model. 
+> Users could copy the tokenizer model into the target folder or specify `tokenizer_id` during initialization. 
+```bash
+python low_bit.py -m <path_to_model> -t <path_to_target> [-q <your question>]
+```
+**Runtime Arguments Explained**:
+- `-m MODEL_PATH`: **Required**, the path to the model
+- `-t TARGET_PATH`: **Required**, the path to save the low_bit model
+- `-q QUESTION`: the question
+
+
+
 ### Legacy (Native INT4 examples)

 IPEX-LLM also provides langchain integrations using native INT4 mode. Those examples can be foud in [native_int4](./native_int4/) folder. For detailed instructions of settting up and running `native_int4` examples, refer to [Native INT4 Examples README](./README_nativeint4.md). 
--- a/python/llm/example/CPU/LangChain/low_bit.py
+++ b/python/llm/example/CPU/LangChain/low_bit.py
@ -0,0 +1,60 @@
+#
+# Copyright 2016 The BigDL Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+import argparse
+from ipex_llm.langchain.llms import TransformersLLM, TransformersPipelineLLM
+from langchain import PromptTemplate, LLMChain
+from langchain import HuggingFacePipeline
+
+
+def main(args):
+    question = args.question
+    model_path = args.model_path
+    low_bit_model_path = args.target_path
+    template ="""{question}"""
+
+    prompt = PromptTemplate(template=template, input_variables=["question"])
+
+    llm = TransformersLLM.from_model_id(
+        model_id=model_path,
+        model_kwargs={"temperature": 0, "max_length": 64, "trust_remote_code": True},
+    )
+    llm.model.save_low_bit(low_bit_model_path)
+    del llm
+    low_bit_llm = TransformersLLM.from_model_id_low_bit(
+        model_id=low_bit_model_path,
+        tokenizer_id=model_path,
+        model_kwargs={"temperature": 0, "max_length": 64, "trust_remote_code": True}
+    )
+    llm_chain = LLMChain(prompt=prompt, llm=low_bit_llm)
+
+    output = llm_chain.run(question)
+    print("====output=====")
+    print(output)
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='TransformersLLM Langchain Chat Example')
+    parser.add_argument('-m','--model-path', type=str, required=True,
+                        help='the path to transformers model')
+    parser.add_argument('-t','--target-path',type=str,required=True,
+                        help='the path to save the low bit model')
+    parser.add_argument('-q', '--question', type=str, default='What is AI?',
+                        help='qustion you want to ask.')
+    args = parser.parse_args()
+
+    main(args)
--- a/python/llm/example/GPU/LangChain/README.md
+++ b/python/llm/example/GPU/LangChain/README.md
@ -100,4 +100,19 @@ python rag.py -m <path_to_model> [-q QUESTION] [-i INPUT_PATH]
 arguments info:
 - `-m MODEL_PATH`: **required**, path to the model.
 - `-q QUESTION`: question to ask. Default is `What is IPEX?`.
- `-i INPUT_PATH`: path to the input doc.
+- `-i INPUT_PATH`: path to the input doc.
+
+
+#### 5.2. Low Bit
+
+The low_bit example ([low_bit.py](./low_bit.py)) showcases how to use use langchain with low_bit optimized model.
+By `save_low_bit` we save the weights of low_bit model into the target folder.
+> Note: `save_low_bit` only saves the weights of the model. 
+> Users could copy the tokenizer model into the target folder or specify `tokenizer_id` during initialization. 
+```bash
+python low_bit.py -m <path_to_model> -t <path_to_target> [-q <your question>]
+```
+**Runtime Arguments Explained**:
+- `-m MODEL_PATH`: **Required**, the path to the model
+- `-t TARGET_PATH`: **Required**, the path to save the low_bit model
+- `-q QUESTION`: the question
--- a/python/llm/example/GPU/LangChain/low_bit.py
+++ b/python/llm/example/GPU/LangChain/low_bit.py
@ -0,0 +1,64 @@
+#
+# Copyright 2016 The BigDL Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+import argparse
+
+from ipex_llm.langchain.llms import TransformersLLM, TransformersPipelineLLM
+from langchain import PromptTemplate, LLMChain
+from langchain import HuggingFacePipeline
+from torch import device
+
+
+def main(args):
+    question = args.question
+    model_path = args.model_path
+    low_bit_model_path = args.target_path
+    template ="""{question}"""
+
+    prompt = PromptTemplate(template=template, input_variables=["question"])
+
+    llm = TransformersLLM.from_model_id(
+        model_id=model_path,
+        model_kwargs={"temperature": 0, "max_length": 64, "trust_remote_code": True},
+        device_map='xpu'
+    )
+    llm.model.save_low_bit(low_bit_model_path)
+    del llm
+    low_bit_llm = TransformersLLM.from_model_id_low_bit(
+        model_id=low_bit_model_path,
+        tokenizer_id=model_path,
+        device_map='xpu',
+        model_kwargs={"temperature": 0, "max_length": 64, "trust_remote_code": True}
+    )
+    llm_chain = LLMChain(prompt=prompt, llm=low_bit_llm)
+
+    output = llm_chain.run(question)
+    print("====output=====")
+    print(output)
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='TransformersLLM Langchain Chat Example')
+    parser.add_argument('-m','--model-path', type=str, required=True,
+                        help='the path to transformers model')
+    parser.add_argument('-t','--target-path',type=str,required=True,
+                        help='the path to save the low bit model')
+    parser.add_argument('-q', '--question', type=str, default='What is AI?',
+                        help='qustion you want to ask.')
+    args = parser.parse_args()
+
+    main(args)
--- a/python/llm/src/ipex_llm/langchain/llms/transformersllm.py
+++ b/python/llm/src/ipex_llm/langchain/llms/transformersllm.py
@ -48,7 +48,7 @@
 import importlib.util
 import logging
 from typing import Any, List, Mapping, Optional
-
+from ipex_llm.utils.common.log4Error import invalidInputError
 from pydantic import Extra

 from langchain.callbacks.manager import CallbackManagerForLLMRun
@ -90,13 +90,14 @@ class TransformersLLM(LLM):
        model_id: str,
        model_kwargs: Optional[dict] = None,
        device_map: str = 'cpu',
+        tokenizer_id: str = None,
        **kwargs: Any,
    ) -> LLM:
        """
        Construct object from model_id
-        
+
        Args:
-        
+
            model_id: Path for the huggingface repo id to be downloaded or
                      the huggingface checkpoint folder.
            model_kwargs: Keyword arguments that will be passed to the model and tokenizer.
@ -114,21 +115,28 @@ class TransformersLLM(LLM):
            from transformers import AutoTokenizer, LlamaTokenizer

        except ImportError:
-            raise ValueError(
+            invalidInputError(
                "Could not import transformers python package. "
                "Please install it with `pip install transformers`."
            )

        _model_kwargs = model_kwargs or {}
        # TODO: may refactore this code in the future
-        try:
-            tokenizer = AutoTokenizer.from_pretrained(model_id, **_model_kwargs)
-        except:
-            tokenizer = LlamaTokenizer.from_pretrained(model_id, **_model_kwargs)
+        if tokenizer_id is not None:
+            try:
+                tokenizer = AutoTokenizer.from_pretrained(tokenizer_id, **_model_kwargs)
+            except:
+                tokenizer = LlamaTokenizer.from_pretrained(tokenizer_id, **_model_kwargs)
+        else:
+            try:
+                tokenizer = AutoTokenizer.from_pretrained(model_id, **_model_kwargs)
+            except:
+                tokenizer = LlamaTokenizer.from_pretrained(model_id, **_model_kwargs)

        # TODO: may refactore this code in the future
        try:
-            model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, **_model_kwargs)
+            model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True,
+                                                         **_model_kwargs)
        except:
            model = AutoModel.from_pretrained(model_id, load_in_4bit=True, **_model_kwargs)

@ -155,13 +163,12 @@ class TransformersLLM(LLM):
        model_id: str,
        model_kwargs: Optional[dict] = None,
        device_map: str = 'cpu',
+        tokenizer_id: str = None,
        **kwargs: Any,
    ) -> LLM:
        """
        Construct low_bit object from model_id
-        
        Args:
-        
            model_id: Path for the bigdl transformers low-bit model checkpoint folder.
            model_kwargs: Keyword arguments that will be passed to the model and tokenizer.
            kwargs: Extra arguments that will be passed to the model and tokenizer.
@ -177,24 +184,29 @@ class TransformersLLM(LLM):
            from transformers import AutoTokenizer, LlamaTokenizer

        except ImportError:
-            raise ValueError(
+            invalidInputError(
                "Could not import transformers python package. "
                "Please install it with `pip install transformers`."
            )

        _model_kwargs = model_kwargs or {}
        # TODO: may refactore this code in the future
-        try:
-            tokenizer = AutoTokenizer.from_pretrained(model_id, **_model_kwargs)
-        except:
-            tokenizer = LlamaTokenizer.from_pretrained(model_id, **_model_kwargs)
+        if tokenizer_id is not None:
+            try:
+                tokenizer = AutoTokenizer.from_pretrained(tokenizer_id, **_model_kwargs)
+            except:
+                tokenizer = LlamaTokenizer.from_pretrained(tokenizer_id, **_model_kwargs)
+        else:
+            try:
+                tokenizer = AutoTokenizer.from_pretrained(model_id, **_model_kwargs)
+            except:
+                tokenizer = LlamaTokenizer.from_pretrained(model_id, **_model_kwargs)

        # TODO: may refactore this code in the future
        try:
            model = AutoModelForCausalLM.load_low_bit(model_id, **_model_kwargs)
        except:
            model = AutoModel.load_low_bit(model_id, **_model_kwargs)
-        
        # TODO: may refactore this code in the future
        if 'xpu' in device_map:
            model = model.to(device_map)
@ -260,5 +272,5 @@ class TransformersLLM(LLM):
            else:
                stopping_criteria = None
            output = self.model.generate(input_ids, stopping_criteria=stopping_criteria, **kwargs)
-            text = self.tokenizer.decode(output[0], skip_special_tokens=True)[len(prompt) :]
+            text = self.tokenizer.decode(output[0], skip_special_tokens=True)[len(prompt):]
            return text