Update LangChain examples to use upstream (#12388)

* Update LangChain examples to use upstream * Update README and fix links * Update LangChain CPU examples to use upstream * Update LangChain CPU voice_assistant example * Update CPU README * Update GPU README * Remove GPU Langchain vLLM example and fix comments * Change langchain -> LangChain * Add reference for both upstream llms and embeddings * Fix comments * Fix comments * Fix comments * Fix comments * Fix comment
2024-11-26 16:43:15 +08:00 · 2024-11-26 16:43:15 +08:00 · c2efa264d9
commit c2efa264d9
parent 24b46b2b19
11 changed files with 331 additions and 296 deletions
--- a/python/llm/example/CPU/LangChain/README.md
+++ b/python/llm/example/CPU/LangChain/README.md
@ -1,90 +1,141 @@
-## Langchain Examples
+# LangChain Example

-This folder contains examples showcasing how to use `langchain` with `ipex-llm`. 
+The examples in this folder shows how to use [LangChain](https://www.langchain.com/) with `ipex-llm` on Intel CPU.

-### Install-IPEX LLM
+> [!NOTE]
+> Please refer [here](https://python.langchain.com/docs/integrations/llms/ipex_llm) for upstream LangChain LLM documentation with ipex-llm and [here](https://python.langchain.com/docs/integrations/text_embedding/ipex_llm/) for upstream LangChain embedding documentation with ipex-llm.

-Ensure `ipex-llm` is installed by following the [IPEX-LLM Installation Guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_cpu.html). 
+## 0. Requirements
+To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.

-### Install Dependences Required by the Examples
+## 1. Install

+We suggest using conda to manage environment:
+
+On Linux:

 ```bash
-pip install langchain==0.0.184
-pip install -U chromadb==0.3.25
-pip install -U pandas==2.0.3
+conda create -n llm python=3.11
+conda activate llm
+
+# install ipex-llm with 'all' option
+pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
 ```

+On Windows:
+```cmd
+onda create -n llm python=3.11
+conda activate llm

-### Example: Chat
+pip install --pre --upgrade ipex-llm[all]
+```

-The chat example ([chat.py](./chat.py)) shows how to use `LLMChain` to build a chat pipeline. 
+## 2. Run examples with LangChain

-To run the example, execute the following command in the current directory:
+### 2.1. Example: Streaming Chat
+
+Install LangChain dependencies:

 ```bash
-python chat.py -m <path_to_model> [-q <your_question>]
+pip install -U langchain langchain-community
 ```
-> Note: if `-q` is not specified, it will use `What is AI` by default. 

-### Example: RAG (Retrival Augmented Generation) 
-
-The RAG example ([rag.py](./rag.py)) shows how to load the input text into vector database,  and then use `load_qa_chain` to build a retrival pipeline.
-
-To run the example, execute the following command in the current directory:
+In the current directory, run the example with command:

 ```bash
-python rag.py -m <path_to_model> [-q <your_question>] [-i <path_to_input_txt>]
+python chat.py -m MODEL_PATH -q QUESTION
 ```
-> Note: If `-i` is not specified, it will use a short introduction to Big-DL as input by default. if `-q` is not specified, `What is IPEX LLM?` will be used by default. 
+**Additional Parameters for Configuration:**
+- `-m MODEL_PATH`: **required**, path to the model
+- `-q QUESTION`: question to ask. Default is `What is AI?`.
+
+### 2.2. Example: Retrival Augmented Generation (RAG)
+
+The RAG example ([rag.py](./rag.py)) shows how to load the input text into vector database, and then use LangChain to build a retrival pipeline.
+
+Install LangChain dependencies:
+
+```bash
+pip install -U langchain langchain-community langchain-chroma sentence-transformers==3.0.1
+```
+
+In the current directory, run the example with command:
+
+```bash
+python rag.py -m <path_to_llm_model> -e <path_to_embedding_model> [-q QUESTION] [-i INPUT_PATH]
+```
+**Additional Parameters for Configuration:**
+- `-m LLM_MODEL_PATH`: **required**, path to the model.
+- `-e EMBEDDING_MODEL_PATH`: **required**, path to the embedding model.
+- `-q QUESTION`: question to ask. Default is `What is IPEX-LLM?`.
+- `-i INPUT_PATH`: path to the input doc.


-### Example: Math
+### 2.3. Example: Low Bit
+
+The low_bit example ([low_bit.py](./low_bit.py)) showcases how to use use LangChain with low_bit optimized model.
+By `save_low_bit` we save the weights of low_bit model into the target folder.
+> [!NOTE]
+> `save_low_bit` only saves the weights of the model. 
+> Users could copy the tokenizer model into the target folder or specify `tokenizer_id` during initialization. 
+
+Install LangChain dependencies:
+
+```bash
+pip install -U langchain langchain-community
+```
+
+In the current directory, run the example with command:
+
+```bash
+python low_bit.py -m <path_to_model> -t <path_to_target> [-q <your question>]
+```
+**Additional Parameters for Configuration:**
+- `-m MODEL_PATH`: **Required**, the path to the model
+- `-t TARGET_PATH`: **Required**, the path to save the low_bit model
+- `-q QUESTION`: question to ask. Default is `What is AI?`.
+
+### 2.4. Example: Math

 The math example ([math.py](./llm_math.py)) shows how to build a chat pipeline specialized in solving math questions. For example, you can ask `What is 13 raised to the .3432 power?`

-To run the exmaple, execute the following command in the current directory:
+Install LangChain dependencies:
+
+```bash
+pip install -U langchain langchain-community
+```
+
+In the current directory, run the example with command:

 ```bash
 python llm_math.py -m <path_to_model> [-q <your_question>]
 ```
-> Note: if `-q` is not specified, it will use `What is 13 raised to the .3432 power?` by default. 

+**Additional Parameters for Configuration:**
+- `-m MODEL_PATH`: **Required**, the path to the model
+- `-q QUESTION`: question to ask. Default is `What is 13 raised to the .3432 power?`.

-### Example: Voice Assistant
+> [!NOTE]
+> If `-q` is not specified, it will use `What is 13 raised to the .3432 power?` by default. 

-The voice assistant example ([voiceassistant.py](./voiceassistant.py)) showcases how to use langchain to build a pipeline that takes in your speech as input in realtime, use an ASR model (e.g. [Whisper-Medium](https://huggingface.co/openai/whisper-medium)) to turn speech into text, and then feed the text into large language model to get response.  
+### 2.5. Example: Voice Assistant
+
+The voice assistant example ([voiceassistant.py](./voiceassistant.py)) showcases how to use LangChain to build a pipeline that takes in your speech as input in realtime, use an ASR model (e.g. [Whisper-Medium](https://huggingface.co/openai/whisper-medium)) to turn speech into text, and then feed the text into large language model to get response.  
+
+Install LangChain dependencies:
+```bash
+pip install -U langchain langchain-community
+pip install transformers==4.36.2
+```

 To run the exmaple, execute the following command in the current directory:

 ```bash
-python voiceassistant.py -m <path_to_model> [-q <your_question>]
+python voiceassistant.py -m <path_to_model> -r <path_to_recognition_model> [-q <your_question>]
 ```
-**Runtime Arguments Explained**:
+**Additional Parameters for Configuration:**
 - `-m MODEL_PATH`: **Required**, the path to the 
 - `-r RECOGNITION_MODEL_PATH`: **Required**,  the path to the huggingface speech recognition model
 - `-x MAX_NEW_TOKENS`: the max new tokens of model tokens input
 - `-l LANGUAGE`: you can specify a language such as "english" or "chinese" 
 - `-d True|False`: whether the model path specified in -m is saved low bit model.
-
-
-### Example: Low Bit
-
-The low_bit example ([low_bit.py](./low_bit.py)) showcases how to use use langchain with low_bit optimized model.
-By `save_low_bit` we save the weights of low_bit model into the target folder.
-> Note: `save_low_bit` only saves the weights of the model. 
-> Users could copy the tokenizer model into the target folder or specify `tokenizer_id` during initialization. 
-```bash
-python low_bit.py -m <path_to_model> -t <path_to_target> [-q <your question>]
-```
-**Runtime Arguments Explained**:
- `-m MODEL_PATH`: **Required**, the path to the model
- `-t TARGET_PATH`: **Required**, the path to save the low_bit model
- `-q QUESTION`: the question
-
-
-
-### Legacy (Native INT4 examples)
-
-IPEX-LLM also provides langchain integrations using native INT4 mode. Those examples can be foud in [native_int4](./native_int4/) folder. For detailed instructions of settting up and running `native_int4` examples, refer to [Native INT4 Examples README](./README_nativeint4.md). 
-
--- a/python/llm/example/CPU/LangChain/chat.py
+++ b/python/llm/example/CPU/LangChain/chat.py
@ -20,10 +20,13 @@
 # only search the first bigdl package and end up finding only one sub-package.

 import argparse
+import warnings

-from ipex_llm.langchain.llms import TransformersLLM, TransformersPipelineLLM
-from langchain import PromptTemplate, LLMChain
-from langchain import HuggingFacePipeline
+from langchain.chains import LLMChain
+from langchain_community.llms import IpexLLM
+from langchain_core.prompts import PromptTemplate
+
+warnings.filterwarnings("ignore", category=UserWarning, message=".*padding_mask.*")


 def main(args):
@ -38,20 +41,18 @@ def main(args):

    prompt = PromptTemplate(template=template, input_variables=["question"])

-    # llm = TransformersPipelineLLM.from_model_id(
-    #     model_id=model_path,
-    #     task="text-generation",
-    #     model_kwargs={"temperature": 0, "max_length": 64, "trust_remote_code": True},
-    # )
-
-    llm = TransformersLLM.from_model_id(
+    llm = IpexLLM.from_model_id(
        model_id=model_path,
-        model_kwargs={"temperature": 0, "max_length": 64, "trust_remote_code": True},
+        model_kwargs={
+            "temperature": 0,
+            "max_length": 64,
+            "trust_remote_code": True,
+        },
    )

-    llm_chain = LLMChain(prompt=prompt, llm=llm)
+    llm_chain = prompt | llm

-    output = llm_chain.run(question)
+    output = llm_chain.invoke(question)
    print("====output=====")
    print(output)

--- a/python/llm/example/CPU/LangChain/llm_math.py
+++ b/python/llm/example/CPU/LangChain/llm_math.py
@ -23,9 +23,12 @@
 # Code is adapted from https://python.langchain.com/docs/modules/chains/additional/llm_math

 import argparse
+import warnings

 from langchain.chains import LLMMathChain
-from ipex_llm.langchain.llms import TransformersLLM, TransformersPipelineLLM
+from langchain_community.llms import IpexLLM
+
+warnings.filterwarnings("ignore", category=UserWarning, message=".*padding_mask.*")


 def main(args):
@ -33,9 +36,13 @@ def main(args):
    question = args.question
    model_path = args.model_path
    
-    llm = TransformersLLM.from_model_id(
+    llm = IpexLLM.from_model_id(
        model_id=model_path,
-        model_kwargs={"temperature": 0, "max_length": 1024, "trust_remote_code": True},
+        model_kwargs={
+            "temperature": 0,
+            "max_length": 1024,
+            "trust_remote_code": True,
+        },
    )
    
    llm_math = LLMMathChain.from_llm(llm, verbose=True)
--- a/python/llm/example/CPU/LangChain/low_bit.py
+++ b/python/llm/example/CPU/LangChain/low_bit.py
@ -16,9 +16,13 @@


 import argparse
-from ipex_llm.langchain.llms import TransformersLLM, TransformersPipelineLLM
-from langchain import PromptTemplate, LLMChain
-from langchain import HuggingFacePipeline
+import warnings
+
+from langchain.chains import LLMChain
+from langchain_community.llms import IpexLLM
+from langchain_core.prompts import PromptTemplate
+
+warnings.filterwarnings("ignore", category=UserWarning, message=".*padding_mask.*")


 def main(args):
@ -29,20 +33,29 @@ def main(args):

    prompt = PromptTemplate(template=template, input_variables=["question"])

-    llm = TransformersLLM.from_model_id(
+    llm = IpexLLM.from_model_id(
        model_id=model_path,
-        model_kwargs={"temperature": 0, "max_length": 64, "trust_remote_code": True},
+        model_kwargs={
+            "temperature": 0,
+            "max_length": 64,
+            "trust_remote_code": True,
+        },
    )
    llm.model.save_low_bit(low_bit_model_path)
    del llm
-    low_bit_llm = TransformersLLM.from_model_id_low_bit(
+    llm_lowbit = IpexLLM.from_model_id_low_bit(
        model_id=low_bit_model_path,
        tokenizer_id=model_path,
-        model_kwargs={"temperature": 0, "max_length": 64, "trust_remote_code": True}
+        # tokenizer_name=saved_lowbit_model_path,  # copy the tokenizers to saved path if you want to use it this way
+        model_kwargs={
+            "temperature": 0,
+            "max_length": 64,
+            "trust_remote_code": True,
+        },
    )
-    llm_chain = LLMChain(prompt=prompt, llm=low_bit_llm)
+    llm_chain = prompt | llm_lowbit

-    output = llm_chain.run(question)
+    output = llm_chain.invoke(question)
    print("====output=====")
    print(output)

--- a/python/llm/example/CPU/LangChain/rag.py
+++ b/python/llm/example/CPU/LangChain/rag.py
@ -19,35 +19,31 @@
 # Otherwise there would be module not found error in non-pip's setting as Python would
 # only search the first bigdl package and end up finding only one sub-package.

-# Code is adapted from https://python.langchain.com/docs/modules/chains/additional/question_answering.html
+# Code is adapted from https://github.com/langchain-ai/langchain/blob/master/docs/docs/tutorials/rag.ipynb

 import argparse
+import warnings

-from langchain.vectorstores import Chroma
-from langchain.chains.chat_vector_db.prompts import (CONDENSE_QUESTION_PROMPT,
-                                                     QA_PROMPT)
-from langchain.text_splitter import CharacterTextSplitter
-from langchain.chains.question_answering import load_qa_chain
-from langchain.callbacks.manager import CallbackManager
+from langchain import hub
+from langchain_text_splitters import CharacterTextSplitter
+from langchain_community.embeddings import IpexLLMBgeEmbeddings
+from langchain_community.llms import IpexLLM
+from langchain_core.runnables import RunnablePassthrough
+from langchain_core.output_parsers import StrOutputParser
+from langchain_chroma import Chroma
+
+warnings.filterwarnings("ignore", category=UserWarning, message=".*padding_mask.*")

-from ipex_llm.langchain.llms import TransformersLLM
-from ipex_llm.langchain.embeddings import TransformersEmbeddings

 text_doc = '''
-BigDL seamlessly scales your data analytics & AI applications from laptop to cloud, with the following libraries:
-LLM: Low-bit (INT3/INT4/INT5/INT8) large language model library for Intel CPU/GPU
-Orca: Distributed Big Data & AI (TF & PyTorch) Pipeline on Spark and Ray
-Nano: Transparent Acceleration of Tensorflow & PyTorch Programs on Intel CPU/GPU
-DLlib: “Equivalent of Spark MLlib” for Deep Learning
-Chronos: Scalable Time Series Analysis using AutoML
-Friesian: End-to-End Recommendation Systems
-PPML: Secure Big Data and AI (with SGX Hardware Security)
+IPEX-LLM is an LLM acceleration library for Intel CPU, GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) and NPU. It is built on top of the excellent work of llama.cpp, transformers, bitsandbytes, vLLM, qlora, AutoGPTQ, AutoAWQ, etc. It provides seamless integration with llama.cpp, Ollama, HuggingFace transformers, LangChain, LlamaIndex, vLLM, Text-Generation-WebUI, DeepSpeed-AutoTP, FastChat, Axolotl, HuggingFace PEFT, HuggingFace TRL, AutoGen, ModeScope, etc. 70+ models have been optimized/verified on ipex-llm (e.g., Llama, Phi, Mistral, Mixtral, Whisper, Qwen, MiniCPM, Qwen-VL, MiniCPM-V and more), with state-of-art LLM optimizations, XPU acceleration and low-bit (FP8/FP6/FP4/INT4) support.
 '''

 def main(args):

    input_path = args.input_path 
    model_path = args.model_path
+    embed_model_path = args.embed_model_path
    query = args.question

    # split texts of input doc
@ -61,35 +57,45 @@ def main(args):
    texts = text_splitter.split_text(input_doc)

    # create embeddings and store into vectordb
-    embeddings = TransformersEmbeddings.from_model_id(
-        model_id=model_path, 
-        model_kwargs={"trust_remote_code": True}
-        )
-    docsearch = Chroma.from_texts(texts, embeddings, metadatas=[{"source": str(i)} for i in range(len(texts))]).as_retriever()
+    embeddings = IpexLLMBgeEmbeddings(
+        model_name=embed_model_path,
+        model_kwargs={},
+        encode_kwargs={"normalize_embeddings": True},
+    )
+    retriever = Chroma.from_texts(texts, embeddings, metadatas=[{"source": str(i)} for i in range(len(texts))]).as_retriever()

-    #get relavant texts
-    docs = docsearch.get_relevant_documents(query)
-
-    bigdl_llm = TransformersLLM.from_model_id(
+    llm = IpexLLM.from_model_id(
        model_id=model_path,
-        model_kwargs={"temperature": 0, "max_length": 1024, "trust_remote_code": True},
+        model_kwargs={
+            "temperature": 0,
+            "max_length": 512,
+            "trust_remote_code": True,
+        },
    )

-    doc_chain = load_qa_chain(
-        bigdl_llm, chain_type="stuff", prompt=QA_PROMPT
-    )
+    def format_docs(docs):
+        return "\n\n".join(doc.page_content for doc in docs)
+    
+    prompt = hub.pull("rlm/rag-prompt")

-    output = doc_chain.run(input_documents=docs, question=query)
-    print(output)
+    rag_chain = (
+        {"context": retriever | format_docs, "question": RunnablePassthrough()}
+        | prompt
+        | llm
+        | StrOutputParser()
+    )
+    rag_chain.invoke(query)


 if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='TransformersLLM Langchain QA over Docs Example')
    parser.add_argument('-m','--model-path', type=str, required=True,
                        help='the path to transformers model')
+    parser.add_argument('-e','--embed-model-path', type=str, required=True,
+                        help='the path to embedding model')
    parser.add_argument('-i', '--input-path', type=str,
                        help='the path to the input doc.')
-    parser.add_argument('-q', '--question', type=str, default='What is BigDL?',
+    parser.add_argument('-q', '--question', type=str, default='What is IPEX-LLM?',
                        help='qustion you want to ask.')
    args = parser.parse_args()
    
--- a/python/llm/example/CPU/LangChain/voiceassistant.py
+++ b/python/llm/example/CPU/LangChain/voiceassistant.py
@ -23,16 +23,19 @@


 from langchain import LLMChain, PromptTemplate
-from ipex_llm.langchain.llms import TransformersLLM
+from langchain_community.llms import IpexLLM
 from langchain.memory import ConversationBufferWindowMemory
 from ipex_llm.transformers import AutoModelForSpeechSeq2Seq
 from transformers import WhisperProcessor
 import speech_recognition as sr
 import numpy as np
-import pyttsx3
 import argparse
+import warnings
 import time

+warnings.filterwarnings("ignore", category=UserWarning, message=".*padding_mask.*")
+
+
 english_template = """
 {history}
 Q: {human_input}
@ -47,8 +50,8 @@ template_dict = {
 }

 llm_load_methods = (
-    TransformersLLM.from_model_id,
-    TransformersLLM.from_model_id_low_bit,
+    IpexLLM.from_model_id,
+    IpexLLM.from_model_id_low_bit,
 )

 def prepare_chain(args):
@ -90,7 +93,6 @@ def listen(chain):

    voiceassitant_chain, processor, recogn_model, forced_decoder_ids = chain

-    # engine = pyttsx3.init()
    r = sr.Recognizer()
    with sr.Microphone(device_index=1, sample_rate=16000) as source:
        print("Calibrating...")
--- a/python/llm/example/GPU/LangChain/README.md
+++ b/python/llm/example/GPU/LangChain/README.md
@ -1,11 +1,34 @@
-# Langchain examples
+# Langchain Example

 The examples in this folder shows how to use [LangChain](https://www.langchain.com/) with `ipex-llm` on Intel GPU.

-### 1. Install ipex-llm
-Follow the instructions in [GPU Install Guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html) to install ipex-llm
+> [!NOTE]
+> Please refer [here](https://python.langchain.com/docs/integrations/llms/ipex_llm) for upstream LangChain LLM documentation with ipex-llm and [here](https://python.langchain.com/docs/integrations/text_embedding/ipex_llm_gpu/) for upstream LangChain embedding documentation with ipex-llm.

-### 2. Configures OneAPI environment variables for Linux
+## 0. Requirements
+To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#requirements) for more information.
+
+## 1. Install
+
+### 1.1 Installation on Linux
+We suggest using conda to manage environment:
+```bash
+conda create -n llm python=3.11
+conda activate llm
+
+pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
+```
+
+### 1.2 Installation on Windows
+We suggest using conda to manage environment:
+```bash
+conda create -n llm python=3.11 libuv
+conda activate llm
+
+pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
+```
+
+## 2. Configures OneAPI environment variables for Linux

 > [!NOTE]
 > Skip this step if you are running on Windows.
@ -16,9 +39,9 @@ This is a required step on Linux for APT or offline installed oneAPI. Skip this
 source /opt/intel/oneapi/setvars.sh
 ```

-### 3. Runtime Configurations
+## 3. Runtime Configurations
 For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
-#### 3.1 Configurations for Linux
+### 3.1 Configurations for Linux
 <details>

 <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
@ -55,7 +78,7 @@ export BIGDL_LLM_XMX_DISABLED=1

 </details>

-#### 3.2 Configurations for Windows
+### 3.2 Configurations for Windows
 <details>

 <summary>For Intel iGPU</summary>
@ -80,105 +103,67 @@ set SYCL_CACHE_PERSISTENT=1
 > [!NOTE]
 > For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.

-### 4. Run the examples
+## 4. Run examples with LangChain

-#### 4.1. Streaming Chat
+### 4.1. Example: Streaming Chat

-Install dependencies:
+Install LangChain dependencies:

 ```bash
-pip install langchain==0.0.184
-pip install -U pandas==2.0.3
+pip install -U langchain langchain-community
 ```

-Then execute:
+In the current directory, run the example with command:

 ```bash
 python chat.py -m MODEL_PATH -q QUESTION
 ```
-arguments info:
+**Additional Parameters for Configuration:**
 - `-m MODEL_PATH`: **required**, path to the model
 - `-q QUESTION`: question to ask. Default is `What is AI?`.

-#### 4.2. RAG (Retrival Augmented Generation)
+### 4.2. Example: Retrival Augmented Generation (RAG)

-Install dependencies:
-```bash
-pip install langchain==0.0.184
-pip install -U chromadb==0.3.25
-pip install -U pandas==2.0.3
-```
+The RAG example ([rag.py](./rag.py)) shows how to load the input text into vector database, and then use LangChain to build a retrival pipeline.

-Then execute:
+Install LangChain dependencies:

 ```bash
-python rag.py -m <path_to_model> [-q QUESTION] [-i INPUT_PATH]
+pip install -U langchain langchain-community langchain-chroma sentence-transformers==3.0.1
 ```
-arguments info:
- `-m MODEL_PATH`: **required**, path to the model.
- `-q QUESTION`: question to ask. Default is `What is IPEX?`.
+
+In the current directory, run the example with command:
+
+```bash
+python rag.py -m <path_to_llm_model> -e <path_to_embedding_model> [-q QUESTION] [-i INPUT_PATH]
+```
+**Additional Parameters for Configuration:**
+- `-m LLM_MODEL_PATH`: **required**, path to the model.
+- `-e EMBEDDING_MODEL_PATH`: **required**, path to the embedding model.
+- `-q QUESTION`: question to ask. Default is `What is IPEX-LLM?`.
 - `-i INPUT_PATH`: path to the input doc.


-#### 4.3. Low Bit
+### 4.3. Example: Low Bit

-The low_bit example ([low_bit.py](./low_bit.py)) showcases how to use use langchain with low_bit optimized model.
+The low_bit example ([low_bit.py](./low_bit.py)) showcases how to use use LangChain with low_bit optimized model.LangChain
 By `save_low_bit` we save the weights of low_bit model into the target folder.
-> Note: `save_low_bit` only saves the weights of the model. 
+> [!NOTE]
+> `save_low_bit` only saves the weights of the model. 
 > Users could copy the tokenizer model into the target folder or specify `tokenizer_id` during initialization. 

-Install dependencies:
+Install LangChain dependencies:
+
 ```bash
-pip install langchain==0.0.184
-pip install -U pandas==2.0.3
+pip install -U langchain langchain-community
 ```
-Then execute:
+
+In the current directory, run the example with command:

 ```bash
 python low_bit.py -m <path_to_model> -t <path_to_target> [-q <your question>]
 ```
-**Runtime Arguments Explained**:
+**Additional Parameters for Configuration:**
 - `-m MODEL_PATH`: **Required**, the path to the model
 - `-t TARGET_PATH`: **Required**, the path to save the low_bit model
- `-q QUESTION`: the question
-
-#### 4.4 vLLM
-
-The vLLM example ([vllm.py](./vllm.py)) showcases how to use langchain with ipex-llm integrated vLLM engine.
-
-Install dependencies:
-```bash
-pip install "langchain<0.2"
-```
-
-Besides, you should also install IPEX-LLM integrated vLLM according instructions listed [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#install-vllm)
-
-**Runtime Arguments Explained**:
- `-m MODEL_PATH`: **Required**, the path to the model
- `-q QUESTION`: the question
- `-t MAX_TOKENS`: max tokens to generate, default 128
- `-p TENSOR_PARALLEL_SIZE`: Use multiple cards for generation
- `-l LOAD_IN_LOW_BIT`: Low bit format for quantization
-
-##### Single card
-
-The following command shows an example on how to execute the example using one card:
-
-```bash
-python ./vllm.py -m YOUR_MODEL_PATH -q "What is AI?" -t 128 -p 1 -l sym_int4
-```
-
-##### Multi cards
-
-To use `-p TENSOR_PARALLEL_SIZE` option, you will need to use our docker image: `intelanalytics/ipex-llm-serving-xpu:latest`. For how to use the image, try check this [guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/vllm_docker_quickstart.html#multi-card-serving).
-
-The following command shows an example on how to execute the example using two cards:
-
-```bash
-export CCL_WORKER_COUNT=2
-export FI_PROVIDER=shm
-export CCL_ATL_TRANSPORT=ofi
-export CCL_ZE_IPC_EXCHANGE=sockets
-export CCL_ATL_SHM=1
-python ./vllm.py -m YOUR_MODEL_PATH -q "What is AI?" -t 128 -p 2 -l sym_int4
-```
+- `-q QUESTION`: question to ask. Default is `What is AI?`.
--- a/python/llm/example/GPU/LangChain/chat.py
+++ b/python/llm/example/GPU/LangChain/chat.py
@ -20,10 +20,13 @@
 # only search the first bigdl package and end up finding only one sub-package.

 import argparse
+import warnings

-from ipex_llm.langchain.llms import TransformersLLM, TransformersPipelineLLM
-from langchain import PromptTemplate, LLMChain
-from langchain import HuggingFacePipeline
+from langchain.chains import LLMChain
+from langchain_community.llms import IpexLLM
+from langchain_core.prompts import PromptTemplate
+
+warnings.filterwarnings("ignore", category=UserWarning, message=".*padding_mask.*")


 def main(args):
@ -38,22 +41,19 @@ def main(args):

    prompt = PromptTemplate(template=template, input_variables=["question"])

-    # llm = TransformersPipelineLLM.from_model_id(
-    #     model_id=model_path,
-    #     task="text-generation",
-    #     model_kwargs={"temperature": 0, "max_length": 64, "trust_remote_code": True},
-    #     device_map='xpu'
-    # )
-
-    llm = TransformersLLM.from_model_id(
+    llm = IpexLLM.from_model_id(
        model_id=model_path,
-        model_kwargs={"temperature": 0, "max_length": 64, "trust_remote_code": True},
-        device_map='xpu'
+        model_kwargs={
+            "temperature": 0,
+            "max_length": 64,
+            "trust_remote_code": True,
+            "device": "xpu",
+        },
    )

-    llm_chain = LLMChain(prompt=prompt, llm=llm)
+    llm_chain = prompt | llm

-    output = llm_chain.run(question)
+    output = llm_chain.invoke(question)
    print("====output=====")
    print(output)

--- a/python/llm/example/GPU/LangChain/low_bit.py
+++ b/python/llm/example/GPU/LangChain/low_bit.py
@ -16,11 +16,13 @@


 import argparse
+import warnings

-from ipex_llm.langchain.llms import TransformersLLM, TransformersPipelineLLM
-from langchain import PromptTemplate, LLMChain
-from langchain import HuggingFacePipeline
-from torch import device
+from langchain.chains import LLMChain
+from langchain_community.llms import IpexLLM
+from langchain_core.prompts import PromptTemplate
+
+warnings.filterwarnings("ignore", category=UserWarning, message=".*padding_mask.*")


 def main(args):
@ -31,22 +33,31 @@ def main(args):

    prompt = PromptTemplate(template=template, input_variables=["question"])

-    llm = TransformersLLM.from_model_id(
+    llm = IpexLLM.from_model_id(
        model_id=model_path,
-        model_kwargs={"temperature": 0, "max_length": 64, "trust_remote_code": True},
-        device_map='xpu'
+        model_kwargs={
+            "temperature": 0,
+            "max_length": 64,
+            "trust_remote_code": True,
+            "device": "xpu",
+        },
    )
    llm.model.save_low_bit(low_bit_model_path)
    del llm
-    low_bit_llm = TransformersLLM.from_model_id_low_bit(
+    llm_lowbit = IpexLLM.from_model_id_low_bit(
        model_id=low_bit_model_path,
        tokenizer_id=model_path,
-        device_map='xpu',
-        model_kwargs={"temperature": 0, "max_length": 64, "trust_remote_code": True}
+        # tokenizer_name=saved_lowbit_model_path,  # copy the tokenizers to saved path if you want to use it this way
+        model_kwargs={
+            "temperature": 0,
+            "max_length": 64,
+            "trust_remote_code": True,
+            "device": "xpu",
+        },
    )
-    llm_chain = LLMChain(prompt=prompt, llm=low_bit_llm)
+    llm_chain = prompt | llm_lowbit

-    output = llm_chain.run(question)
+    output = llm_chain.invoke(question)
    print("====output=====")
    print(output)

--- a/python/llm/example/GPU/LangChain/rag.py
+++ b/python/llm/example/GPU/LangChain/rag.py
@ -19,36 +19,31 @@
 # Otherwise there would be module not found error in non-pip's setting as Python would
 # only search the first bigdl package and end up finding only one sub-package.

-# Code is adapted from https://python.langchain.com/docs/modules/chains/additional/question_answering.html
+# Code is adapted from https://github.com/langchain-ai/langchain/blob/master/docs/docs/tutorials/rag.ipynb

-import torch
 import argparse
+import warnings

-from langchain.vectorstores import Chroma
-from langchain.chains.chat_vector_db.prompts import (CONDENSE_QUESTION_PROMPT,
-                                                     QA_PROMPT)
-from langchain.text_splitter import CharacterTextSplitter
-from langchain.chains.question_answering import load_qa_chain
-from langchain.callbacks.manager import CallbackManager
+from langchain import hub
+from langchain_text_splitters import CharacterTextSplitter
+from langchain_community.embeddings import IpexLLMBgeEmbeddings
+from langchain_community.llms import IpexLLM
+from langchain_core.runnables import RunnablePassthrough
+from langchain_core.output_parsers import StrOutputParser
+from langchain_chroma import Chroma
+
+warnings.filterwarnings("ignore", category=UserWarning, message=".*padding_mask.*")

-from ipex_llm.langchain.llms import TransformersLLM
-from ipex_llm.langchain.embeddings import TransformersEmbeddings

 text_doc = '''
-BigDL seamlessly scales your data analytics & AI applications from laptop to cloud, with the following libraries:
-LLM: Low-bit (INT3/INT4/INT5/INT8) large language model library for Intel CPU/GPU
-Orca: Distributed Big Data & AI (TF & PyTorch) Pipeline on Spark and Ray
-Nano: Transparent Acceleration of Tensorflow & PyTorch Programs on Intel CPU/GPU
-DLlib: "Equivalent of Spark MLlib" for Deep Learning
-Chronos: Scalable Time Series Analysis using AutoML
-Friesian: End-to-End Recommendation Systems
-PPML: Secure Big Data and AI (with SGX Hardware Security)
+IPEX-LLM is an LLM acceleration library for Intel CPU, GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) and NPU. It is built on top of the excellent work of llama.cpp, transformers, bitsandbytes, vLLM, qlora, AutoGPTQ, AutoAWQ, etc. It provides seamless integration with llama.cpp, Ollama, HuggingFace transformers, LangChain, LlamaIndex, vLLM, Text-Generation-WebUI, DeepSpeed-AutoTP, FastChat, Axolotl, HuggingFace PEFT, HuggingFace TRL, AutoGen, ModeScope, etc. 70+ models have been optimized/verified on ipex-llm (e.g., Llama, Phi, Mistral, Mixtral, Whisper, Qwen, MiniCPM, Qwen-VL, MiniCPM-V and more), with state-of-art LLM optimizations, XPU acceleration and low-bit (FP8/FP6/FP4/INT4) support.
 '''

 def main(args):

    input_path = args.input_path 
    model_path = args.model_path
+    embed_model_path = args.embed_model_path
    query = args.question

    # split texts of input doc
@ -62,37 +57,46 @@ def main(args):
    texts = text_splitter.split_text(input_doc)

    # create embeddings and store into vectordb
-    embeddings = TransformersEmbeddings.from_model_id(
-        model_id=model_path, 
-        model_kwargs={"trust_remote_code": True},
-        device_map='xpu'
-        )
-    docsearch = Chroma.from_texts(texts, embeddings, metadatas=[{"source": str(i)} for i in range(len(texts))]).as_retriever()
+    embeddings = IpexLLMBgeEmbeddings(
+        model_name=embed_model_path,
+        model_kwargs={"device": "xpu"},
+        encode_kwargs={"normalize_embeddings": True},
+    )
+    retriever = Chroma.from_texts(texts, embeddings, metadatas=[{"source": str(i)} for i in range(len(texts))]).as_retriever()

-    #get relavant texts
-    docs = docsearch.get_relevant_documents(query)
-
-    bigdl_llm = TransformersLLM.from_model_id(
+    llm = IpexLLM.from_model_id(
        model_id=model_path,
-        model_kwargs={"temperature": 0, "max_length": 1024, "trust_remote_code": True},
-        device_map='xpu'
+        model_kwargs={
+            "temperature": 0,
+            "max_length": 512,
+            "trust_remote_code": True,
+            "device": "xpu",
+        },
    )

-    doc_chain = load_qa_chain(
-        bigdl_llm, chain_type="stuff", prompt=QA_PROMPT
-    )
+    def format_docs(docs):
+        return "\n\n".join(doc.page_content for doc in docs)
+    
+    prompt = hub.pull("rlm/rag-prompt")

-    output = doc_chain.run(input_documents=docs, question=query)
-    print(output)
+    rag_chain = (
+        {"context": retriever | format_docs, "question": RunnablePassthrough()}
+        | prompt
+        | llm
+        | StrOutputParser()
+    )
+    rag_chain.invoke(query)


 if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='TransformersLLM Langchain QA over Docs Example')
    parser.add_argument('-m','--model-path', type=str, required=True,
                        help='the path to transformers model')
+    parser.add_argument('-e','--embed-model-path', type=str, required=True,
+                        help='the path to embedding model')
    parser.add_argument('-i', '--input-path', type=str,
                        help='the path to the input doc.')
-    parser.add_argument('-q', '--question', type=str, default='What is BigDL?',
+    parser.add_argument('-q', '--question', type=str, default='What is IPEX-LLM?',
                        help='qustion you want to ask.')
    args = parser.parse_args()
    
--- a/python/llm/example/GPU/LangChain/vllm.py
+++ b/python/llm/example/GPU/LangChain/vllm.py
@ -1,45 +0,0 @@
-from ipex_llm.langchain.vllm.vllm import VLLM
-from langchain.chains import LLMChain
-from langchain_core.prompts import PromptTemplate
-import argparse
-
-def main(args):
-    llm = VLLM(
-        model=args.model_path,
-        trust_remote_code=True,  # mandatory for hf models
-        max_new_tokens=128,
-        top_k=10,
-        top_p=0.95,
-        temperature=0.8,
-        max_model_len=2048,
-        enforce_eager=True,
-        load_in_low_bit=args.load_in_low_bit,
-        device="xpu",
-        tensor_parallel_size=args.tensor_parallel_size,
-    )
-
-    print(llm.invoke(args.question))
-
-    template = """Question: {question}
-
-    Answer: Let's think step by step."""""
-    prompt = PromptTemplate.from_template(template)
-
-    llm_chain = LLMChain(prompt=prompt, llm=llm)
-
-    print(llm_chain.invoke("Who was the US president in the year the first Pokemon game was released?"))
-
-
-
-if __name__ == '__main__':
-    parser = argparse.ArgumentParser(description='Langchain integrated vLLM example')
-    parser.add_argument('-m','--model-path', type=str, required=True,
-                        help='the path to transformers model')
-    parser.add_argument('-q', '--question', type=str, default='What is the capital of France?', help='qustion you want to ask.')
-    parser.add_argument('-t', '--max-tokens', type=int, default=128, help='max tokens to generate')
-    parser.add_argument('-p', '--tensor-parallel-size', type=int, default=1, help="vLLM tensor parallel size")
-    parser.add_argument('-l', '--load-in-low-bit', type=str, default='sym_int4', help="low bit format")
-    args = parser.parse_args()
-
-    main(args)
-