Add LlamaIndex RAG (#10263)

* run demo * format code * add llamaindex * add custom LLM with bigdl * update * add readme * begin ut * add unit test * add license * add license * revised * update * modify docs * remove data folder * update * modify prompt * fixed * fixed * fixed
2024-02-29 15:21:19 +08:00 · 2024-02-29 15:21:19 +08:00 · 4e6cc424f1
commit 4e6cc424f1
parent 5d7243067c
6 changed files with 898 additions and 0 deletions
--- a/python/llm/example/CPU/LlamaIndex/README.md
+++ b/python/llm/example/CPU/LlamaIndex/README.md
@ -0,0 +1,60 @@
+# LlamaIndex Examples
+
+The examples here show how to use LlamaIndex with `bigdl-llm`.
+The RAG example is modified from the [demo](https://docs.llamaindex.ai/en/stable/examples/low_level/oss_ingestion_retrieval.html). 
+
+## Install bigdl-llm
+Follow the instructions in [Install](https://github.com/intel-analytics/BigDL/tree/main/python/llm#install).
+
+## Install Required Dependencies for llamaindex examples. 
+
+### Install Site-packages
+```bash
+pip install llama-index-readers-file
+pip install llama-index-vector-stores-postgres
+pip install llama-index-embeddings-huggingface
+```
+
+### Install Postgres
+> Note: There are plenty of open-source databases you can use. Here we provide an example using Postgres. 
+* Download and install postgres by running the commands below. 
+    ```bash
+    sudo apt-get install postgresql-client
+    sudo apt-get install postgresql
+    ```
+* Initilize postgres. 
+    ```bash
+    sudo su - postgres
+    psql
+    ```
+    After running the commands in the shell, we reach the console of postgres. Then we can add a role like the following
+    ```bash
+    CREATE ROLE <user> WITH LOGIN PASSWORD '<password>';
+    ALTER ROLE <user> SUPERUSER;    
+    ```
+* Install pgvector according to the [page](https://github.com/pgvector/pgvector). If you encounter problem about the installation, please refer to the [notes](https://github.com/pgvector/pgvector#installation-notes) which may be helpful. 
+* Download the database.
+    ```bash
+    mkdir data
+    wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"
+    ```
+
+
+## Run the examples
+
+### Retrieval-augmented Generation
+```bash
+python rag.py -m MODEL_PATH -e EMBEDDING_MODEL_PATH -u USERNAME -p PASSWORD -q QUESTION -d DATA
+```
+arguments info:
+- `-m MODEL_PATH`: **required**, path to the llama model
+- `-e EMBEDDING_MODEL_PATH`: path to the embedding model
+- `-u USERNAME`: username in the postgres database
+- `-p PASSWORD`: password in the postgres database
+- `-q QUESTION`: question you want to ask
+- `-d DATA`: path to data used during retrieval
+
+Here is the sample output when applying Llama-2-7b-chat-hf as the generatio model when we ask "How does Llama 2 perform compared to other open-source models?" and use llama.pdf as database. 
+```
+Llama 2 performs better than most open-source models on the benchmarks we tested. Specifically, it outperforms all open-source models on MMLU and BBH, and is close to GPT-3.5 on these benchmarks. Additionally, Llama 2 is on par or better than PaLM-2-L on almost all benchmarks. The only exception is the coding benchmarks, where Llama 2 lags significantly behind GPT-4 and PaLM-2-L. Overall, Llama 2 demonstrates strong performance on a wide range of natural language processing tasks.
+```
--- a/python/llm/example/CPU/LlamaIndex/rag.py
+++ b/python/llm/example/CPU/LlamaIndex/rag.py
@ -0,0 +1,247 @@
+#
+# Copyright 2016 The BigDL Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from llama_index.embeddings.huggingface import HuggingFaceEmbedding
+from sqlalchemy import make_url
+from llama_index.vector_stores.postgres import PGVectorStore
+# from llama_index.llms.llama_cpp import LlamaCPP
+import psycopg2
+from pathlib import Path
+from llama_index.readers.file import PyMuPDFReader
+from llama_index.core.schema import NodeWithScore
+from typing import Optional
+from llama_index.core.query_engine import RetrieverQueryEngine
+from llama_index.core import QueryBundle
+from llama_index.core.retrievers import BaseRetriever
+from typing import Any, List
+from llama_index.core.node_parser import SentenceSplitter
+from llama_index.core.vector_stores import VectorStoreQuery
+import argparse
+
+def load_vector_database(username, password):
+    db_name = "example_db"
+    host = "localhost"
+    password = password
+    port = "5432"
+    user = username
+    # conn = psycopg2.connect(connection_string)
+    conn = psycopg2.connect(
+        dbname="postgres",
+        host=host,
+        password=password,
+        port=port,
+        user=user,
+    )
+    conn.autocommit = True
+
+    with conn.cursor() as c:
+        c.execute(f"DROP DATABASE IF EXISTS {db_name}")
+        c.execute(f"CREATE DATABASE {db_name}")
+    
+    vector_store = PGVectorStore.from_params(
+        database=db_name,
+        host=host,
+        password=password,
+        port=port,
+        user=user,
+        table_name="llama2_paper",
+        embed_dim=384,  # openai embedding dimension
+    )
+    return vector_store
+
+
+def load_data(data_path):
+    loader = PyMuPDFReader()
+    documents = loader.load(file_path=data_path)
+
+
+    text_parser = SentenceSplitter(
+        chunk_size=1024,
+        # separator=" ",
+    )
+    text_chunks = []
+    # maintain relationship with source doc index, to help inject doc metadata in (3)
+    doc_idxs = []
+    for doc_idx, doc in enumerate(documents):
+        cur_text_chunks = text_parser.split_text(doc.text)
+        text_chunks.extend(cur_text_chunks)
+        doc_idxs.extend([doc_idx] * len(cur_text_chunks))
+
+    from llama_index.core.schema import TextNode
+    nodes = []
+    for idx, text_chunk in enumerate(text_chunks):
+        node = TextNode(
+            text=text_chunk,
+        )
+        src_doc = documents[doc_idxs[idx]]
+        node.metadata = src_doc.metadata
+        nodes.append(node)
+    return nodes
+    
+
+
+
+
+
+class VectorDBRetriever(BaseRetriever):
+    """Retriever over a postgres vector store."""
+
+    def __init__(
+        self,
+        vector_store: PGVectorStore,
+        embed_model: Any,
+        query_mode: str = "default",
+        similarity_top_k: int = 2,
+    ) -> None:
+        """Init params."""
+        self._vector_store = vector_store
+        self._embed_model = embed_model
+        self._query_mode = query_mode
+        self._similarity_top_k = similarity_top_k
+        super().__init__()
+
+    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
+        """Retrieve."""
+        query_embedding = self._embed_model.get_query_embedding(
+            query_bundle.query_str
+        )
+        vector_store_query = VectorStoreQuery(
+            query_embedding=query_embedding,
+            similarity_top_k=self._similarity_top_k,
+            mode=self._query_mode,
+        )
+        query_result = self._vector_store.query(vector_store_query)
+
+        nodes_with_scores = []
+        for index, node in enumerate(query_result.nodes):
+            score: Optional[float] = None
+            if query_result.similarities is not None:
+                score = query_result.similarities[index]
+            nodes_with_scores.append(NodeWithScore(node=node, score=score))
+
+        return nodes_with_scores
+    
+def completion_to_prompt(completion):
+    return f"<|system|>\n</s>\n<|user|>\n{completion}</s>\n<|assistant|>\n"
+
+
+# Transform a list of chat messages into zephyr-specific input
+def messages_to_prompt(messages):
+    prompt = ""
+    for message in messages:
+        if message.role == "system":
+            prompt += f"<|system|>\n{message.content}</s>\n"
+        elif message.role == "user":
+            prompt += f"<|user|>\n{message.content}</s>\n"
+        elif message.role == "assistant":
+            prompt += f"<|assistant|>\n{message.content}</s>\n"
+
+    # ensure we start with a system prompt, insert blank if needed
+    if not prompt.startswith("<|system|>\n"):
+        prompt = "<|system|>\n</s>\n" + prompt
+
+    # add final assistant prompt
+    prompt = prompt + "<|assistant|>\n"
+
+    return prompt
+
+def main(args):
+    embed_model = HuggingFaceEmbedding(model_name=args.embedding_model_path)
+    
+    # Use custom LLM in BigDL
+    from bigdl.llm.llamaindex.llms import BigdlLLM
+    llm = BigdlLLM(
+        model_name=args.model_path,
+        tokenizer_name=args.model_path,
+        context_window=512,
+        max_new_tokens=32,
+        generate_kwargs={"temperature": 0.7, "do_sample": False},
+        model_kwargs={},
+        messages_to_prompt=messages_to_prompt,
+        completion_to_prompt=completion_to_prompt,
+        device_map="cpu",
+    )
+    
+    vector_store = load_vector_database(username=args.user, password=args.password)
+    nodes = load_data(data_path=args.data)
+    for node in nodes:
+        node_embedding = embed_model.get_text_embedding(
+            node.get_content(metadata_mode="all")
+        )
+        node.embedding = node_embedding
+    
+    vector_store.add(nodes)
+    
+    # query_str = "Can you tell me about the key concepts for safety finetuning"
+    query_str = "Explain about the training data for Llama 2"
+    query_embedding = embed_model.get_query_embedding(query_str)
+    # construct vector store query
+    
+
+    query_mode = "default"
+    # query_mode = "sparse"
+    # query_mode = "hybrid"
+
+    vector_store_query = VectorStoreQuery(
+        query_embedding=query_embedding, similarity_top_k=2, mode=query_mode
+    )
+    # returns a VectorStoreQueryResult
+    query_result = vector_store.query(vector_store_query)
+    # print("Retrieval Results: ")
+    # print(query_result.nodes[0].get_content())
+
+
+
+    nodes_with_scores = []
+    for index, node in enumerate(query_result.nodes):
+        score: Optional[float] = None
+        if query_result.similarities is not None:
+            score = query_result.similarities[index]
+        nodes_with_scores.append(NodeWithScore(node=node, score=score))
+    
+    retriever = VectorDBRetriever(
+        vector_store, embed_model, query_mode="default", similarity_top_k=1
+    )
+    
+    
+    query_engine = RetrieverQueryEngine.from_args(retriever, llm=llm)
+
+    # query_str = "How does Llama 2 perform compared to other open-source models?"
+    query_str = args.question
+    response = query_engine.query(query_str)
+
+
+    print("------------RESPONSE GENERATION---------------------")
+    print(str(response))
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description='LlamaIndex BigdlLLM Example')
+    parser.add_argument('-m','--model-path', type=str, required=True,
+                        help='the path to transformers model')
+    parser.add_argument('-q', '--question', type=str, default='How does Llama 2 perform compared to other open-source models?',
+                        help='qustion you want to ask.')
+    parser.add_argument('-d','--data',type=str, default='./data/llama2.pdf',
+                        help="the data used during retrieval")
+    parser.add_argument('-u', '--user', type=str, required=True,
+                        help="user name in the database postgres")
+    parser.add_argument('-p','--password', type=str, required=True,
+                        help="the password of the user in the database")
+    parser.add_argument('-e','--embedding-model-path',default="BAAI/bge-small-en",
+                        help="the path to embedding model path")
+    args = parser.parse_args()
+    
+    main(args)
--- a/python/llm/src/bigdl/llm/llamaindex/init.py
+++ b/python/llm/src/bigdl/llm/llamaindex/init.py
@ -0,0 +1,20 @@
+#
+# Copyright 2016 The BigDL Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This would makes sure Python is aware there is more than one sub-package within bigdl,
+# physically located elsewhere.
+# Otherwise there would be module not found error in non-pip's setting as Python would
+# only search the first bigdl package and end up finding only one sub-package.
--- a/python/llm/src/bigdl/llm/llamaindex/llms/init.py
+++ b/python/llm/src/bigdl/llm/llamaindex/llms/init.py
@ -0,0 +1,34 @@
+#
+# Copyright 2016 The BigDL Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This would makes sure Python is aware there is more than one sub-package within bigdl,
+# physically located elsewhere.
+# Otherwise there would be module not found error in non-pip's setting as Python would
+# only search the first bigdl package and end up finding only one sub-package.
+
+"""Wrappers on top of large language models APIs."""
+from typing import Dict, Type
+
+from .bigdlllm import *
+from llama_index.core.base.llms.base import BaseLLM
+
+__all__ = [
+    "BigdlLLM",
+]
+
+type_to_cls_dict: Dict[str, Type[BaseLLM]] = {
+    "BigdlLLM": BigdlLLM,
+}
--- a/python/llm/src/bigdl/llm/llamaindex/llms/bigdlllm.py
+++ b/python/llm/src/bigdl/llm/llamaindex/llms/bigdlllm.py
@ -0,0 +1,449 @@
+#
+# Copyright 2016 The BigDL Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# The file is modified from
+# https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/base/llms/base.py
+
+# The MIT License
+
+# Copyright (c) Harrison Chase
+
+# Permission is hereby granted, free of charge, to any person obtaining a copy
+# of this software and associated documentation files (the "Software"), to deal
+# in the Software without restriction, including without limitation the rights
+# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Software, and to permit persons to whom the Software is
+# furnished to do so, subject to the following conditions:
+
+# The above copyright notice and this permission notice shall be included in
+# all copies or substantial portions of the Software.
+
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+# THE SOFTWARE.
+
+import logging
+from threading import Thread
+from typing import Any, Callable, Dict, List, Optional, Sequence, Union
+
+import torch
+from huggingface_hub import AsyncInferenceClient, InferenceClient, model_info
+from llama_index.core.base.llms.types import (
+    ChatMessage,
+    ChatResponse,
+    ChatResponseAsyncGen,
+    ChatResponseGen,
+    CompletionResponse,
+    CompletionResponseAsyncGen,
+    CompletionResponseGen,
+    LLMMetadata,
+    MessageRole,
+)
+from llama_index.core.bridge.pydantic import Field, PrivateAttr
+from llama_index.core.callbacks import CallbackManager
+from llama_index.core.constants import (
+    DEFAULT_CONTEXT_WINDOW,
+    DEFAULT_NUM_OUTPUTS,
+)
+from llama_index.core.llms.callbacks import (
+    llm_chat_callback,
+    llm_completion_callback,
+)
+from llama_index.core.llms.custom import CustomLLM
+
+from llama_index.core.base.llms.generic_utils import (
+    messages_to_prompt as generic_messages_to_prompt,
+)
+from llama_index.core.prompts.base import PromptTemplate
+from llama_index.core.types import BaseOutputParser, PydanticProgramMode
+from transformers import (
+    StoppingCriteria,
+    StoppingCriteriaList,
+)
+from transformers import AutoTokenizer, LlamaTokenizer
+
+
+DEFAULT_HUGGINGFACE_MODEL = "meta-llama/Llama-2-7b-chat-hf"
+
+logger = logging.getLogger(__name__)
+
+
+class BigdlLLM(CustomLLM):
+    """Wrapper around the BigDL-LLM
+
+    Example:
+        .. code-block:: python
+
+            from bigdl.llm.llamaindex.llms import BigdlLLM
+            llm = BigdlLLM(model_path="/path/to/llama/model")
+    """
+
+    model_name: str = Field(
+        default=DEFAULT_HUGGINGFACE_MODEL,
+        description=(
+            "The model name to use from HuggingFace. "
+            "Unused if `model` is passed in directly."
+        ),
+    )
+    context_window: int = Field(
+        default=DEFAULT_CONTEXT_WINDOW,
+        description="The maximum number of tokens available for input.",
+        gt=0,
+    )
+    max_new_tokens: int = Field(
+        default=DEFAULT_NUM_OUTPUTS,
+        description="The maximum number of tokens to generate.",
+        gt=0,
+    )
+    system_prompt: str = Field(
+        default="",
+        description=(
+            "The system prompt, containing any extra instructions or context. "
+            "The model card on HuggingFace should specify if this is needed."
+        ),
+    )
+    query_wrapper_prompt: PromptTemplate = Field(
+        default=PromptTemplate("{query_str}"),
+        description=(
+            "The query wrapper prompt, containing the query placeholder. "
+            "The model card on HuggingFace should specify if this is needed. "
+            "Should contain a `{query_str}` placeholder."
+        ),
+    )
+    tokenizer_name: str = Field(
+        default=DEFAULT_HUGGINGFACE_MODEL,
+        description=(
+            "The name of the tokenizer to use from HuggingFace. "
+            "Unused if `tokenizer` is passed in directly."
+        ),
+    )
+    device_map: str = Field(
+        default="auto", description="The device_map to use. Defaults to 'auto'."
+    )
+    stopping_ids: List[int] = Field(
+        default_factory=list,
+        description=(
+            "The stopping ids to use. "
+            "Generation stops when these token IDs are predicted."
+        ),
+    )
+    tokenizer_outputs_to_remove: list = Field(
+        default_factory=list,
+        description=(
+            "The outputs to remove from the tokenizer. "
+            "Sometimes huggingface tokenizers return extra inputs that cause errors."
+        ),
+    )
+    tokenizer_kwargs: dict = Field(
+        default_factory=dict, description="The kwargs to pass to the tokenizer."
+    )
+    model_kwargs: dict = Field(
+        default_factory=dict,
+        description="The kwargs to pass to the model during initialization.",
+    )
+    generate_kwargs: dict = Field(
+        default_factory=dict,
+        description="The kwargs to pass to the model during generation.",
+    )
+    is_chat_model: bool = Field(
+        default=False,
+        description=(
+            LLMMetadata.__fields__["is_chat_model"].field_info.description
+            + " Be sure to verify that you either pass an appropriate tokenizer "
+            "that can convert prompts to properly formatted chat messages or a "
+            "`messages_to_prompt` that does so."
+        ),
+    )
+
+    _model: Any = PrivateAttr()
+    _tokenizer: Any = PrivateAttr()
+    _stopping_criteria: Any = PrivateAttr()
+
+    def __init__(
+        self,
+        context_window: int = DEFAULT_CONTEXT_WINDOW,
+        max_new_tokens: int = DEFAULT_NUM_OUTPUTS,
+        query_wrapper_prompt: Union[str, PromptTemplate]="{query_str}",
+        tokenizer_name: str = DEFAULT_HUGGINGFACE_MODEL,
+        model_name: str = DEFAULT_HUGGINGFACE_MODEL,
+        model: Optional[Any] = None,
+        tokenizer: Optional[Any] = None,
+        device_map: Optional[str] = "auto",
+        stopping_ids: Optional[List[int]] = None,
+        tokenizer_kwargs: Optional[dict] = None,
+        tokenizer_outputs_to_remove: Optional[list] = None,
+        model_kwargs: Optional[dict] = None,
+        generate_kwargs: Optional[dict] = None,
+        is_chat_model: Optional[bool] = False,
+        callback_manager: Optional[CallbackManager] = None,
+        system_prompt: str = "",
+        messages_to_prompt: Optional[Callable[[Sequence[ChatMessage]], str]]=None,
+        completion_to_prompt: Optional[Callable[[str], str]]=None,
+        pydantic_program_mode: PydanticProgramMode=PydanticProgramMode.DEFAULT,
+        output_parser: Optional[BaseOutputParser] = None,
+    ) -> None:
+        """
+        Construct BigdlLLM.
+
+        Args:
+
+            context_window: The maximum number of tokens available for input.
+            max_new_tokens: The maximum number of tokens to generate.
+            query_wrapper_prompt: The query wrapper prompt, containing the query placeholder.
+                        Should contain a `{query_str}` placeholder.
+            tokenizer_name: The name of the tokenizer to use from HuggingFace.
+                        Unused if `tokenizer` is passed in directly.
+            model_name: The model name to use from HuggingFace.
+                        Unused if `model` is passed in directly.
+            model: The HuggingFace model.
+            tokenizer: The tokenizer.
+            device_map: The device_map to use. Defaults to 'auto'.
+            stopping_ids: The stopping ids to use.
+                        Generation stops when these token IDs are predicted.
+            tokenizer_kwargs: The kwargs to pass to the tokenizer.
+            tokenizer_outputs_to_remove: The outputs to remove from the tokenizer.
+                        Sometimes huggingface tokenizers return extra inputs that cause errors.
+            model_kwargs: The kwargs to pass to the model during initialization.
+            generate_kwargs: The kwargs to pass to the model during generation.
+            is_chat_model: Whether the model is `chat`
+            callback_manager: Callback manager.
+            system_prompt: The system prompt, containing any extra instructions or context.
+            messages_to_prompt: Function to convert messages to prompt.
+            completion_to_prompt: Function to convert messages to prompt.
+            pydantic_program_mode: DEFAULT.
+            output_parser: BaseOutputParser.
+
+        Returns:
+            None.
+        """
+        model_kwargs = model_kwargs or {}
+        from bigdl.llm.transformers import AutoModelForCausalLM
+        self._model = model or AutoModelForCausalLM.from_pretrained(
+            model_name, load_in_4bit=True, **model_kwargs
+        )
+
+        # check context_window
+        config_dict = self._model.config.to_dict()
+        model_context_window = int(
+            config_dict.get("max_position_embeddings", context_window)
+        )
+        if model_context_window and model_context_window < context_window:
+            logger.warning(
+                f"Supplied context_window {context_window} is greater "
+                f"than the model's max input size {model_context_window}. "
+                "Disable this warning by setting a lower context_window."
+            )
+            context_window = model_context_window
+
+        tokenizer_kwargs = tokenizer_kwargs or {}
+        if "max_length" not in tokenizer_kwargs:
+            tokenizer_kwargs["max_length"] = context_window
+
+        self._tokenizer = tokenizer or AutoTokenizer.from_pretrained(
+            tokenizer_name, **tokenizer_kwargs
+        )
+
+        if tokenizer_name != model_name:
+            logger.warning(
+                f"The model `{model_name}` and tokenizer `{tokenizer_name}` "
+                f"are different, please ensure that they are compatible."
+            )
+
+        # setup stopping criteria
+        stopping_ids_list = stopping_ids or []
+
+        class StopOnTokens(StoppingCriteria):
+            def __call__(
+                self,
+                input_ids: torch.LongTensor,
+                scores: torch.FloatTensor,
+                **kwargs: Any,
+            ) -> bool:
+                for stop_id in stopping_ids_list:
+                    if input_ids[0][-1] == stop_id:
+                        return True
+                return False
+
+        self._stopping_criteria = StoppingCriteriaList([StopOnTokens()])
+
+        if isinstance(query_wrapper_prompt, str):
+            query_wrapper_prompt = PromptTemplate(query_wrapper_prompt)
+
+        messages_to_prompt = messages_to_prompt or self._tokenizer_messages_to_prompt
+
+        super().__init__(
+            context_window=context_window,
+            max_new_tokens=max_new_tokens,
+            query_wrapper_prompt=query_wrapper_prompt,
+            tokenizer_name=tokenizer_name,
+            model_name=model_name,
+            device_map=device_map,
+            stopping_ids=stopping_ids or [],
+            tokenizer_kwargs=tokenizer_kwargs or {},
+            tokenizer_outputs_to_remove=tokenizer_outputs_to_remove or [],
+            model_kwargs=model_kwargs or {},
+            generate_kwargs=generate_kwargs or {},
+            is_chat_model=is_chat_model,
+            callback_manager=callback_manager,
+            system_prompt=system_prompt,
+            messages_to_prompt=messages_to_prompt,
+            completion_to_prompt=completion_to_prompt,
+            pydantic_program_mode=pydantic_program_mode,
+            output_parser=output_parser,
+        )
+
+    @classmethod
+    def class_name(cls) -> str:
+        """
+        Get class name.
+
+        Args:
+
+
+        Returns:
+            Str of class name.
+        """
+        return "BigDL_LLM"
+
+    @property
+    def metadata(self) -> LLMMetadata:
+        """
+        Get meta data.
+
+        Args:
+
+        Returns:
+            LLMmetadata contains context_window,
+            num_output, model_name, is_chat_model.
+        """
+        return LLMMetadata(
+            context_window=self.context_window,
+            num_output=self.max_new_tokens,
+            model_name=self.model_name,
+            is_chat_model=self.is_chat_model,
+        )
+
+    def _tokenizer_messages_to_prompt(self, messages: Sequence[ChatMessage]) -> str:
+        """
+        Use the tokenizer to convert messages to prompt. Fallback to generic.
+
+        Args:
+            messages: Sequence of ChatMessage.
+
+        Returns:
+            Str of response.
+        """
+        if hasattr(self._tokenizer, "apply_chat_template"):
+            messages_dict = [
+                {"role": message.role.value, "content": message.content}
+                for message in messages
+            ]
+            tokens = self._tokenizer.apply_chat_template(messages_dict)
+            return self._tokenizer.decode(tokens)
+
+        return generic_messages_to_prompt(messages)
+
+    @llm_completion_callback()
+    def complete(
+        self, prompt: str, formatted: bool = False, **kwargs: Any
+    ) -> CompletionResponse:
+        """
+        Complete by LLM.
+
+        Args:
+            prompt: Prompt for completion.
+            formatted: Whether the prompt is formatted by wrapper.
+            kwargs: Other kwargs for complete.
+
+        Returns:
+            CompletionReponse after generation.
+        """
+        full_prompt = prompt
+        if not formatted:
+            if self.query_wrapper_prompt:
+                full_prompt = self.query_wrapper_prompt.format(query_str=prompt)
+            if self.system_prompt:
+                full_prompt = f"{self.system_prompt} {full_prompt}"
+        input_ids = self._tokenizer(full_prompt, return_tensors="pt")
+        input_ids = input_ids.to(self._model.device)
+        # remove keys from the tokenizer if needed, to avoid HF errors
+        for key in self.tokenizer_outputs_to_remove:
+            if key in input_ids:
+                input_ids.pop(key, None)
+        tokens = self._model.generate(
+            **input_ids,
+            max_new_tokens=self.max_new_tokens,
+            stopping_criteria=self._stopping_criteria,
+            **self.generate_kwargs,
+        )
+        completion_tokens = tokens[0][input_ids["input_ids"].size(1):]
+        completion = self._tokenizer.decode(completion_tokens, skip_special_tokens=True)
+
+        return CompletionResponse(text=completion, raw={"model_output": tokens})
+
+    @llm_completion_callback()
+    def stream_complete(
+        self, prompt: str, formatted: bool = False, **kwargs: Any
+    ) -> CompletionResponseGen:
+        """
+        Complete by LLM in stream.
+
+        Args:
+            prompt: Prompt for completion.
+            formatted: Whether the prompt is formatted by wrapper.
+            kwargs: Other kwargs for complete.
+
+        Returns:
+            CompletionReponse after generation.
+        """
+        from transformers import TextStreamer
+        full_prompt = prompt
+        if not formatted:
+            if self.query_wrapper_prompt:
+                full_prompt = self.query_wrapper_prompt.format(query_str=prompt)
+            if self.system_prompt:
+                full_prompt = f"{self.system_prompt} {full_prompt}"
+
+        input_ids = self._tokenizer.encode(full_prompt, return_tensors="pt")
+        input_ids = input_ids.to(self._model.device)
+
+        for key in self.tokenizer_outputs_to_remove:
+            if key in input_ids:
+                input_ids.pop(key, None)
+
+        streamer = TextStreamer(self._tokenizer, skip_prompt=True, skip_special_tokens=True)
+        generation_kwargs = dict(
+            input_ids,
+            streamer=streamer,
+            max_new_tokens=self.max_new_tokens,
+            stopping_criteria=self._stopping_criteria,
+            **self.generate_kwargs,
+        )
+        thread = Thread(target=self._model.generate, kwargs=generation_kwargs)
+        thread.start()
+
+        # create generator based off of streamer
+        def gen() -> CompletionResponseGen:
+            text = ""
+            for x in streamer:
+                text += x
+                yield CompletionResponse(text=text, delta=x)
+
+        return gen()
--- a/python/llm/test/llamaindex/test_llamaindex.py
+++ b/python/llm/test/llamaindex/test_llamaindex.py
@ -0,0 +1,88 @@
+#
+# Copyright 2016 The BigDL Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from bigdl.llm.langchain.llms import TransformersLLM, TransformersPipelineLLM, \
+    LlamaLLM, BloomLLM
+from bigdl.llm.langchain.embeddings import TransformersEmbeddings, LlamaEmbeddings, \
+    BloomEmbeddings
+
+
+from langchain.document_loaders import WebBaseLoader
+from langchain.indexes import VectorstoreIndexCreator
+
+
+from langchain.chains.question_answering import load_qa_chain
+from langchain.chains.chat_vector_db.prompts import (CONDENSE_QUESTION_PROMPT,
+                                                     QA_PROMPT)
+from langchain.text_splitter import CharacterTextSplitter
+from langchain.vectorstores import Chroma
+
+import pytest
+from unittest import TestCase
+import os
+from bigdl.llm.llamaindex.llms import BigdlLLM
+
+class Test_LlamaIndex_Transformers_API(TestCase):
+    def setUp(self):
+        self.auto_model_path = os.environ.get('ORIGINAL_CHATGLM2_6B_PATH')
+        self.auto_causal_model_path = os.environ.get('ORIGINAL_REPLIT_CODE_PATH')
+        self.llama_model_path = os.environ.get('LLAMA_ORIGIN_PATH')
+        self.bloom_model_path = os.environ.get('BLOOM_ORIGIN_PATH')
+        thread_num = os.environ.get('THREAD_NUM')
+        if thread_num is not None:
+            self.n_threads = int(thread_num)
+        else:
+            self.n_threads = 2   
+            
+    def completion_to_prompt(completion):
+        return f"<|system|>\n</s>\n<|user|>\n{completion}</s>\n<|assistant|>\n" 
+    
+    def messages_to_prompt(messages):
+        prompt = ""
+        for message in messages:
+            if message.role == "system":
+                prompt += f"<|system|>\n{message.content}</s>\n"
+            elif message.role == "user":
+                prompt += f"<|user|>\n{message.content}</s>\n"
+            elif message.role == "assistant":
+                prompt += f"<|assistant|>\n{message.content}</s>\n"
+
+        # ensure we start with a system prompt, insert blank if needed
+        if not prompt.startswith("<|system|>\n"):
+            prompt = "<|system|>\n</s>\n" + prompt
+
+        # add final assistant prompt
+        prompt = prompt + "<|assistant|>\n"
+        return prompt
+    
+    def test_bigdl_llm(self):    
+        llm = BigdlLLM(
+            model_name=self.llama_model_path,
+            tokenizer_name=self.llama_model_path,
+            context_window=512,
+            max_new_tokens=32,
+            model_kwargs={},
+            generate_kwargs={"temperature": 0.7, "do_sample": False},
+            messages_to_prompt=self.messages_to_prompt,
+            completion_to_prompt=self.completion_to_prompt,
+            device_map="cpu",
+        )
+        res = llm.complete("What is AI?")
+        assert res!=None
+        
+
+if __name__ == '__main__':
+    pytest.main([__file__])