Update LangChain examples to use upstream (#12388)

* Update LangChain examples to use upstream

* Update README and fix links

* Update LangChain CPU examples to use upstream

* Update LangChain CPU voice_assistant example

* Update CPU README

* Update GPU README

* Remove GPU Langchain vLLM example and fix comments

* Change langchain -> LangChain

* Add reference for both upstream llms and embeddings

* Fix comments

* Fix comments

* Fix comments

* Fix comments

* Fix comment
This commit is contained in:
Jin, Qiao 2024-11-26 16:43:15 +08:00 committed by GitHub
parent 24b46b2b19
commit c2efa264d9
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
11 changed files with 331 additions and 296 deletions

View file

@ -1,90 +1,141 @@
## Langchain Examples
# LangChain Example
This folder contains examples showcasing how to use `langchain` with `ipex-llm`.
The examples in this folder shows how to use [LangChain](https://www.langchain.com/) with `ipex-llm` on Intel CPU.
### Install-IPEX LLM
> [!NOTE]
> Please refer [here](https://python.langchain.com/docs/integrations/llms/ipex_llm) for upstream LangChain LLM documentation with ipex-llm and [here](https://python.langchain.com/docs/integrations/text_embedding/ipex_llm/) for upstream LangChain embedding documentation with ipex-llm.
Ensure `ipex-llm` is installed by following the [IPEX-LLM Installation Guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_cpu.html).
## 0. Requirements
To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
### Install Dependences Required by the Examples
## 1. Install
We suggest using conda to manage environment:
On Linux:
```bash
pip install langchain==0.0.184
pip install -U chromadb==0.3.25
pip install -U pandas==2.0.3
conda create -n llm python=3.11
conda activate llm
# install ipex-llm with 'all' option
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
```
On Windows:
```cmd
onda create -n llm python=3.11
conda activate llm
### Example: Chat
pip install --pre --upgrade ipex-llm[all]
```
The chat example ([chat.py](./chat.py)) shows how to use `LLMChain` to build a chat pipeline.
## 2. Run examples with LangChain
To run the example, execute the following command in the current directory:
### 2.1. Example: Streaming Chat
Install LangChain dependencies:
```bash
python chat.py -m <path_to_model> [-q <your_question>]
pip install -U langchain langchain-community
```
> Note: if `-q` is not specified, it will use `What is AI` by default.
### Example: RAG (Retrival Augmented Generation)
The RAG example ([rag.py](./rag.py)) shows how to load the input text into vector database, and then use `load_qa_chain` to build a retrival pipeline.
To run the example, execute the following command in the current directory:
In the current directory, run the example with command:
```bash
python rag.py -m <path_to_model> [-q <your_question>] [-i <path_to_input_txt>]
python chat.py -m MODEL_PATH -q QUESTION
```
> Note: If `-i` is not specified, it will use a short introduction to Big-DL as input by default. if `-q` is not specified, `What is IPEX LLM?` will be used by default.
**Additional Parameters for Configuration:**
- `-m MODEL_PATH`: **required**, path to the model
- `-q QUESTION`: question to ask. Default is `What is AI?`.
### 2.2. Example: Retrival Augmented Generation (RAG)
The RAG example ([rag.py](./rag.py)) shows how to load the input text into vector database, and then use LangChain to build a retrival pipeline.
Install LangChain dependencies:
```bash
pip install -U langchain langchain-community langchain-chroma sentence-transformers==3.0.1
```
In the current directory, run the example with command:
```bash
python rag.py -m <path_to_llm_model> -e <path_to_embedding_model> [-q QUESTION] [-i INPUT_PATH]
```
**Additional Parameters for Configuration:**
- `-m LLM_MODEL_PATH`: **required**, path to the model.
- `-e EMBEDDING_MODEL_PATH`: **required**, path to the embedding model.
- `-q QUESTION`: question to ask. Default is `What is IPEX-LLM?`.
- `-i INPUT_PATH`: path to the input doc.
### Example: Math
### 2.3. Example: Low Bit
The low_bit example ([low_bit.py](./low_bit.py)) showcases how to use use LangChain with low_bit optimized model.
By `save_low_bit` we save the weights of low_bit model into the target folder.
> [!NOTE]
> `save_low_bit` only saves the weights of the model.
> Users could copy the tokenizer model into the target folder or specify `tokenizer_id` during initialization.
Install LangChain dependencies:
```bash
pip install -U langchain langchain-community
```
In the current directory, run the example with command:
```bash
python low_bit.py -m <path_to_model> -t <path_to_target> [-q <your question>]
```
**Additional Parameters for Configuration:**
- `-m MODEL_PATH`: **Required**, the path to the model
- `-t TARGET_PATH`: **Required**, the path to save the low_bit model
- `-q QUESTION`: question to ask. Default is `What is AI?`.
### 2.4. Example: Math
The math example ([math.py](./llm_math.py)) shows how to build a chat pipeline specialized in solving math questions. For example, you can ask `What is 13 raised to the .3432 power?`
To run the exmaple, execute the following command in the current directory:
Install LangChain dependencies:
```bash
pip install -U langchain langchain-community
```
In the current directory, run the example with command:
```bash
python llm_math.py -m <path_to_model> [-q <your_question>]
```
> Note: if `-q` is not specified, it will use `What is 13 raised to the .3432 power?` by default.
**Additional Parameters for Configuration:**
- `-m MODEL_PATH`: **Required**, the path to the model
- `-q QUESTION`: question to ask. Default is `What is 13 raised to the .3432 power?`.
### Example: Voice Assistant
> [!NOTE]
> If `-q` is not specified, it will use `What is 13 raised to the .3432 power?` by default.
The voice assistant example ([voiceassistant.py](./voiceassistant.py)) showcases how to use langchain to build a pipeline that takes in your speech as input in realtime, use an ASR model (e.g. [Whisper-Medium](https://huggingface.co/openai/whisper-medium)) to turn speech into text, and then feed the text into large language model to get response.
### 2.5. Example: Voice Assistant
The voice assistant example ([voiceassistant.py](./voiceassistant.py)) showcases how to use LangChain to build a pipeline that takes in your speech as input in realtime, use an ASR model (e.g. [Whisper-Medium](https://huggingface.co/openai/whisper-medium)) to turn speech into text, and then feed the text into large language model to get response.
Install LangChain dependencies:
```bash
pip install -U langchain langchain-community
pip install transformers==4.36.2
```
To run the exmaple, execute the following command in the current directory:
```bash
python voiceassistant.py -m <path_to_model> [-q <your_question>]
python voiceassistant.py -m <path_to_model> -r <path_to_recognition_model> [-q <your_question>]
```
**Runtime Arguments Explained**:
**Additional Parameters for Configuration:**
- `-m MODEL_PATH`: **Required**, the path to the
- `-r RECOGNITION_MODEL_PATH`: **Required**, the path to the huggingface speech recognition model
- `-x MAX_NEW_TOKENS`: the max new tokens of model tokens input
- `-l LANGUAGE`: you can specify a language such as "english" or "chinese"
- `-d True|False`: whether the model path specified in -m is saved low bit model.
### Example: Low Bit
The low_bit example ([low_bit.py](./low_bit.py)) showcases how to use use langchain with low_bit optimized model.
By `save_low_bit` we save the weights of low_bit model into the target folder.
> Note: `save_low_bit` only saves the weights of the model.
> Users could copy the tokenizer model into the target folder or specify `tokenizer_id` during initialization.
```bash
python low_bit.py -m <path_to_model> -t <path_to_target> [-q <your question>]
```
**Runtime Arguments Explained**:
- `-m MODEL_PATH`: **Required**, the path to the model
- `-t TARGET_PATH`: **Required**, the path to save the low_bit model
- `-q QUESTION`: the question
### Legacy (Native INT4 examples)
IPEX-LLM also provides langchain integrations using native INT4 mode. Those examples can be foud in [native_int4](./native_int4/) folder. For detailed instructions of settting up and running `native_int4` examples, refer to [Native INT4 Examples README](./README_nativeint4.md).

View file

@ -20,10 +20,13 @@
# only search the first bigdl package and end up finding only one sub-package.
import argparse
import warnings
from ipex_llm.langchain.llms import TransformersLLM, TransformersPipelineLLM
from langchain import PromptTemplate, LLMChain
from langchain import HuggingFacePipeline
from langchain.chains import LLMChain
from langchain_community.llms import IpexLLM
from langchain_core.prompts import PromptTemplate
warnings.filterwarnings("ignore", category=UserWarning, message=".*padding_mask.*")
def main(args):
@ -38,20 +41,18 @@ def main(args):
prompt = PromptTemplate(template=template, input_variables=["question"])
# llm = TransformersPipelineLLM.from_model_id(
# model_id=model_path,
# task="text-generation",
# model_kwargs={"temperature": 0, "max_length": 64, "trust_remote_code": True},
# )
llm = TransformersLLM.from_model_id(
llm = IpexLLM.from_model_id(
model_id=model_path,
model_kwargs={"temperature": 0, "max_length": 64, "trust_remote_code": True},
model_kwargs={
"temperature": 0,
"max_length": 64,
"trust_remote_code": True,
},
)
llm_chain = LLMChain(prompt=prompt, llm=llm)
llm_chain = prompt | llm
output = llm_chain.run(question)
output = llm_chain.invoke(question)
print("====output=====")
print(output)

View file

@ -23,9 +23,12 @@
# Code is adapted from https://python.langchain.com/docs/modules/chains/additional/llm_math
import argparse
import warnings
from langchain.chains import LLMMathChain
from ipex_llm.langchain.llms import TransformersLLM, TransformersPipelineLLM
from langchain_community.llms import IpexLLM
warnings.filterwarnings("ignore", category=UserWarning, message=".*padding_mask.*")
def main(args):
@ -33,9 +36,13 @@ def main(args):
question = args.question
model_path = args.model_path
llm = TransformersLLM.from_model_id(
llm = IpexLLM.from_model_id(
model_id=model_path,
model_kwargs={"temperature": 0, "max_length": 1024, "trust_remote_code": True},
model_kwargs={
"temperature": 0,
"max_length": 1024,
"trust_remote_code": True,
},
)
llm_math = LLMMathChain.from_llm(llm, verbose=True)

View file

@ -16,9 +16,13 @@
import argparse
from ipex_llm.langchain.llms import TransformersLLM, TransformersPipelineLLM
from langchain import PromptTemplate, LLMChain
from langchain import HuggingFacePipeline
import warnings
from langchain.chains import LLMChain
from langchain_community.llms import IpexLLM
from langchain_core.prompts import PromptTemplate
warnings.filterwarnings("ignore", category=UserWarning, message=".*padding_mask.*")
def main(args):
@ -29,20 +33,29 @@ def main(args):
prompt = PromptTemplate(template=template, input_variables=["question"])
llm = TransformersLLM.from_model_id(
llm = IpexLLM.from_model_id(
model_id=model_path,
model_kwargs={"temperature": 0, "max_length": 64, "trust_remote_code": True},
model_kwargs={
"temperature": 0,
"max_length": 64,
"trust_remote_code": True,
},
)
llm.model.save_low_bit(low_bit_model_path)
del llm
low_bit_llm = TransformersLLM.from_model_id_low_bit(
llm_lowbit = IpexLLM.from_model_id_low_bit(
model_id=low_bit_model_path,
tokenizer_id=model_path,
model_kwargs={"temperature": 0, "max_length": 64, "trust_remote_code": True}
# tokenizer_name=saved_lowbit_model_path, # copy the tokenizers to saved path if you want to use it this way
model_kwargs={
"temperature": 0,
"max_length": 64,
"trust_remote_code": True,
},
)
llm_chain = LLMChain(prompt=prompt, llm=low_bit_llm)
llm_chain = prompt | llm_lowbit
output = llm_chain.run(question)
output = llm_chain.invoke(question)
print("====output=====")
print(output)

View file

@ -19,35 +19,31 @@
# Otherwise there would be module not found error in non-pip's setting as Python would
# only search the first bigdl package and end up finding only one sub-package.
# Code is adapted from https://python.langchain.com/docs/modules/chains/additional/question_answering.html
# Code is adapted from https://github.com/langchain-ai/langchain/blob/master/docs/docs/tutorials/rag.ipynb
import argparse
import warnings
from langchain.vectorstores import Chroma
from langchain.chains.chat_vector_db.prompts import (CONDENSE_QUESTION_PROMPT,
QA_PROMPT)
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains.question_answering import load_qa_chain
from langchain.callbacks.manager import CallbackManager
from langchain import hub
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.embeddings import IpexLLMBgeEmbeddings
from langchain_community.llms import IpexLLM
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_chroma import Chroma
warnings.filterwarnings("ignore", category=UserWarning, message=".*padding_mask.*")
from ipex_llm.langchain.llms import TransformersLLM
from ipex_llm.langchain.embeddings import TransformersEmbeddings
text_doc = '''
BigDL seamlessly scales your data analytics & AI applications from laptop to cloud, with the following libraries:
LLM: Low-bit (INT3/INT4/INT5/INT8) large language model library for Intel CPU/GPU
Orca: Distributed Big Data & AI (TF & PyTorch) Pipeline on Spark and Ray
Nano: Transparent Acceleration of Tensorflow & PyTorch Programs on Intel CPU/GPU
DLlib: Equivalent of Spark MLlib for Deep Learning
Chronos: Scalable Time Series Analysis using AutoML
Friesian: End-to-End Recommendation Systems
PPML: Secure Big Data and AI (with SGX Hardware Security)
IPEX-LLM is an LLM acceleration library for Intel CPU, GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) and NPU. It is built on top of the excellent work of llama.cpp, transformers, bitsandbytes, vLLM, qlora, AutoGPTQ, AutoAWQ, etc. It provides seamless integration with llama.cpp, Ollama, HuggingFace transformers, LangChain, LlamaIndex, vLLM, Text-Generation-WebUI, DeepSpeed-AutoTP, FastChat, Axolotl, HuggingFace PEFT, HuggingFace TRL, AutoGen, ModeScope, etc. 70+ models have been optimized/verified on ipex-llm (e.g., Llama, Phi, Mistral, Mixtral, Whisper, Qwen, MiniCPM, Qwen-VL, MiniCPM-V and more), with state-of-art LLM optimizations, XPU acceleration and low-bit (FP8/FP6/FP4/INT4) support.
'''
def main(args):
input_path = args.input_path
model_path = args.model_path
embed_model_path = args.embed_model_path
query = args.question
# split texts of input doc
@ -61,35 +57,45 @@ def main(args):
texts = text_splitter.split_text(input_doc)
# create embeddings and store into vectordb
embeddings = TransformersEmbeddings.from_model_id(
model_id=model_path,
model_kwargs={"trust_remote_code": True}
)
docsearch = Chroma.from_texts(texts, embeddings, metadatas=[{"source": str(i)} for i in range(len(texts))]).as_retriever()
embeddings = IpexLLMBgeEmbeddings(
model_name=embed_model_path,
model_kwargs={},
encode_kwargs={"normalize_embeddings": True},
)
retriever = Chroma.from_texts(texts, embeddings, metadatas=[{"source": str(i)} for i in range(len(texts))]).as_retriever()
#get relavant texts
docs = docsearch.get_relevant_documents(query)
bigdl_llm = TransformersLLM.from_model_id(
llm = IpexLLM.from_model_id(
model_id=model_path,
model_kwargs={"temperature": 0, "max_length": 1024, "trust_remote_code": True},
model_kwargs={
"temperature": 0,
"max_length": 512,
"trust_remote_code": True,
},
)
doc_chain = load_qa_chain(
bigdl_llm, chain_type="stuff", prompt=QA_PROMPT
)
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
prompt = hub.pull("rlm/rag-prompt")
output = doc_chain.run(input_documents=docs, question=query)
print(output)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
rag_chain.invoke(query)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='TransformersLLM Langchain QA over Docs Example')
parser.add_argument('-m','--model-path', type=str, required=True,
help='the path to transformers model')
parser.add_argument('-e','--embed-model-path', type=str, required=True,
help='the path to embedding model')
parser.add_argument('-i', '--input-path', type=str,
help='the path to the input doc.')
parser.add_argument('-q', '--question', type=str, default='What is BigDL?',
parser.add_argument('-q', '--question', type=str, default='What is IPEX-LLM?',
help='qustion you want to ask.')
args = parser.parse_args()

View file

@ -23,16 +23,19 @@
from langchain import LLMChain, PromptTemplate
from ipex_llm.langchain.llms import TransformersLLM
from langchain_community.llms import IpexLLM
from langchain.memory import ConversationBufferWindowMemory
from ipex_llm.transformers import AutoModelForSpeechSeq2Seq
from transformers import WhisperProcessor
import speech_recognition as sr
import numpy as np
import pyttsx3
import argparse
import warnings
import time
warnings.filterwarnings("ignore", category=UserWarning, message=".*padding_mask.*")
english_template = """
{history}
Q: {human_input}
@ -47,8 +50,8 @@ template_dict = {
}
llm_load_methods = (
TransformersLLM.from_model_id,
TransformersLLM.from_model_id_low_bit,
IpexLLM.from_model_id,
IpexLLM.from_model_id_low_bit,
)
def prepare_chain(args):
@ -90,7 +93,6 @@ def listen(chain):
voiceassitant_chain, processor, recogn_model, forced_decoder_ids = chain
# engine = pyttsx3.init()
r = sr.Recognizer()
with sr.Microphone(device_index=1, sample_rate=16000) as source:
print("Calibrating...")

View file

@ -1,11 +1,34 @@
# Langchain examples
# Langchain Example
The examples in this folder shows how to use [LangChain](https://www.langchain.com/) with `ipex-llm` on Intel GPU.
### 1. Install ipex-llm
Follow the instructions in [GPU Install Guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html) to install ipex-llm
> [!NOTE]
> Please refer [here](https://python.langchain.com/docs/integrations/llms/ipex_llm) for upstream LangChain LLM documentation with ipex-llm and [here](https://python.langchain.com/docs/integrations/text_embedding/ipex_llm_gpu/) for upstream LangChain embedding documentation with ipex-llm.
### 2. Configures OneAPI environment variables for Linux
## 0. Requirements
To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#requirements) for more information.
## 1. Install
### 1.1 Installation on Linux
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.11
conda activate llm
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
```
### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.11 libuv
conda activate llm
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
```
## 2. Configures OneAPI environment variables for Linux
> [!NOTE]
> Skip this step if you are running on Windows.
@ -16,9 +39,9 @@ This is a required step on Linux for APT or offline installed oneAPI. Skip this
source /opt/intel/oneapi/setvars.sh
```
### 3. Runtime Configurations
## 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
### 3.1 Configurations for Linux
<details>
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
@ -55,7 +78,7 @@ export BIGDL_LLM_XMX_DISABLED=1
</details>
#### 3.2 Configurations for Windows
### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
@ -80,105 +103,67 @@ set SYCL_CACHE_PERSISTENT=1
> [!NOTE]
> For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Run the examples
## 4. Run examples with LangChain
#### 4.1. Streaming Chat
### 4.1. Example: Streaming Chat
Install dependencies:
Install LangChain dependencies:
```bash
pip install langchain==0.0.184
pip install -U pandas==2.0.3
pip install -U langchain langchain-community
```
Then execute:
In the current directory, run the example with command:
```bash
python chat.py -m MODEL_PATH -q QUESTION
```
arguments info:
**Additional Parameters for Configuration:**
- `-m MODEL_PATH`: **required**, path to the model
- `-q QUESTION`: question to ask. Default is `What is AI?`.
#### 4.2. RAG (Retrival Augmented Generation)
### 4.2. Example: Retrival Augmented Generation (RAG)
Install dependencies:
```bash
pip install langchain==0.0.184
pip install -U chromadb==0.3.25
pip install -U pandas==2.0.3
```
The RAG example ([rag.py](./rag.py)) shows how to load the input text into vector database, and then use LangChain to build a retrival pipeline.
Then execute:
Install LangChain dependencies:
```bash
python rag.py -m <path_to_model> [-q QUESTION] [-i INPUT_PATH]
pip install -U langchain langchain-community langchain-chroma sentence-transformers==3.0.1
```
arguments info:
- `-m MODEL_PATH`: **required**, path to the model.
- `-q QUESTION`: question to ask. Default is `What is IPEX?`.
In the current directory, run the example with command:
```bash
python rag.py -m <path_to_llm_model> -e <path_to_embedding_model> [-q QUESTION] [-i INPUT_PATH]
```
**Additional Parameters for Configuration:**
- `-m LLM_MODEL_PATH`: **required**, path to the model.
- `-e EMBEDDING_MODEL_PATH`: **required**, path to the embedding model.
- `-q QUESTION`: question to ask. Default is `What is IPEX-LLM?`.
- `-i INPUT_PATH`: path to the input doc.
#### 4.3. Low Bit
### 4.3. Example: Low Bit
The low_bit example ([low_bit.py](./low_bit.py)) showcases how to use use langchain with low_bit optimized model.
The low_bit example ([low_bit.py](./low_bit.py)) showcases how to use use LangChain with low_bit optimized model.LangChain
By `save_low_bit` we save the weights of low_bit model into the target folder.
> Note: `save_low_bit` only saves the weights of the model.
> [!NOTE]
> `save_low_bit` only saves the weights of the model.
> Users could copy the tokenizer model into the target folder or specify `tokenizer_id` during initialization.
Install dependencies:
Install LangChain dependencies:
```bash
pip install langchain==0.0.184
pip install -U pandas==2.0.3
pip install -U langchain langchain-community
```
Then execute:
In the current directory, run the example with command:
```bash
python low_bit.py -m <path_to_model> -t <path_to_target> [-q <your question>]
```
**Runtime Arguments Explained**:
**Additional Parameters for Configuration:**
- `-m MODEL_PATH`: **Required**, the path to the model
- `-t TARGET_PATH`: **Required**, the path to save the low_bit model
- `-q QUESTION`: the question
#### 4.4 vLLM
The vLLM example ([vllm.py](./vllm.py)) showcases how to use langchain with ipex-llm integrated vLLM engine.
Install dependencies:
```bash
pip install "langchain<0.2"
```
Besides, you should also install IPEX-LLM integrated vLLM according instructions listed [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#install-vllm)
**Runtime Arguments Explained**:
- `-m MODEL_PATH`: **Required**, the path to the model
- `-q QUESTION`: the question
- `-t MAX_TOKENS`: max tokens to generate, default 128
- `-p TENSOR_PARALLEL_SIZE`: Use multiple cards for generation
- `-l LOAD_IN_LOW_BIT`: Low bit format for quantization
##### Single card
The following command shows an example on how to execute the example using one card:
```bash
python ./vllm.py -m YOUR_MODEL_PATH -q "What is AI?" -t 128 -p 1 -l sym_int4
```
##### Multi cards
To use `-p TENSOR_PARALLEL_SIZE` option, you will need to use our docker image: `intelanalytics/ipex-llm-serving-xpu:latest`. For how to use the image, try check this [guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/vllm_docker_quickstart.html#multi-card-serving).
The following command shows an example on how to execute the example using two cards:
```bash
export CCL_WORKER_COUNT=2
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
python ./vllm.py -m YOUR_MODEL_PATH -q "What is AI?" -t 128 -p 2 -l sym_int4
```
- `-q QUESTION`: question to ask. Default is `What is AI?`.

View file

@ -20,10 +20,13 @@
# only search the first bigdl package and end up finding only one sub-package.
import argparse
import warnings
from ipex_llm.langchain.llms import TransformersLLM, TransformersPipelineLLM
from langchain import PromptTemplate, LLMChain
from langchain import HuggingFacePipeline
from langchain.chains import LLMChain
from langchain_community.llms import IpexLLM
from langchain_core.prompts import PromptTemplate
warnings.filterwarnings("ignore", category=UserWarning, message=".*padding_mask.*")
def main(args):
@ -38,22 +41,19 @@ def main(args):
prompt = PromptTemplate(template=template, input_variables=["question"])
# llm = TransformersPipelineLLM.from_model_id(
# model_id=model_path,
# task="text-generation",
# model_kwargs={"temperature": 0, "max_length": 64, "trust_remote_code": True},
# device_map='xpu'
# )
llm = TransformersLLM.from_model_id(
llm = IpexLLM.from_model_id(
model_id=model_path,
model_kwargs={"temperature": 0, "max_length": 64, "trust_remote_code": True},
device_map='xpu'
model_kwargs={
"temperature": 0,
"max_length": 64,
"trust_remote_code": True,
"device": "xpu",
},
)
llm_chain = LLMChain(prompt=prompt, llm=llm)
llm_chain = prompt | llm
output = llm_chain.run(question)
output = llm_chain.invoke(question)
print("====output=====")
print(output)

View file

@ -16,11 +16,13 @@
import argparse
import warnings
from ipex_llm.langchain.llms import TransformersLLM, TransformersPipelineLLM
from langchain import PromptTemplate, LLMChain
from langchain import HuggingFacePipeline
from torch import device
from langchain.chains import LLMChain
from langchain_community.llms import IpexLLM
from langchain_core.prompts import PromptTemplate
warnings.filterwarnings("ignore", category=UserWarning, message=".*padding_mask.*")
def main(args):
@ -31,22 +33,31 @@ def main(args):
prompt = PromptTemplate(template=template, input_variables=["question"])
llm = TransformersLLM.from_model_id(
llm = IpexLLM.from_model_id(
model_id=model_path,
model_kwargs={"temperature": 0, "max_length": 64, "trust_remote_code": True},
device_map='xpu'
model_kwargs={
"temperature": 0,
"max_length": 64,
"trust_remote_code": True,
"device": "xpu",
},
)
llm.model.save_low_bit(low_bit_model_path)
del llm
low_bit_llm = TransformersLLM.from_model_id_low_bit(
llm_lowbit = IpexLLM.from_model_id_low_bit(
model_id=low_bit_model_path,
tokenizer_id=model_path,
device_map='xpu',
model_kwargs={"temperature": 0, "max_length": 64, "trust_remote_code": True}
# tokenizer_name=saved_lowbit_model_path, # copy the tokenizers to saved path if you want to use it this way
model_kwargs={
"temperature": 0,
"max_length": 64,
"trust_remote_code": True,
"device": "xpu",
},
)
llm_chain = LLMChain(prompt=prompt, llm=low_bit_llm)
llm_chain = prompt | llm_lowbit
output = llm_chain.run(question)
output = llm_chain.invoke(question)
print("====output=====")
print(output)

View file

@ -19,36 +19,31 @@
# Otherwise there would be module not found error in non-pip's setting as Python would
# only search the first bigdl package and end up finding only one sub-package.
# Code is adapted from https://python.langchain.com/docs/modules/chains/additional/question_answering.html
# Code is adapted from https://github.com/langchain-ai/langchain/blob/master/docs/docs/tutorials/rag.ipynb
import torch
import argparse
import warnings
from langchain.vectorstores import Chroma
from langchain.chains.chat_vector_db.prompts import (CONDENSE_QUESTION_PROMPT,
QA_PROMPT)
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains.question_answering import load_qa_chain
from langchain.callbacks.manager import CallbackManager
from langchain import hub
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.embeddings import IpexLLMBgeEmbeddings
from langchain_community.llms import IpexLLM
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_chroma import Chroma
warnings.filterwarnings("ignore", category=UserWarning, message=".*padding_mask.*")
from ipex_llm.langchain.llms import TransformersLLM
from ipex_llm.langchain.embeddings import TransformersEmbeddings
text_doc = '''
BigDL seamlessly scales your data analytics & AI applications from laptop to cloud, with the following libraries:
LLM: Low-bit (INT3/INT4/INT5/INT8) large language model library for Intel CPU/GPU
Orca: Distributed Big Data & AI (TF & PyTorch) Pipeline on Spark and Ray
Nano: Transparent Acceleration of Tensorflow & PyTorch Programs on Intel CPU/GPU
DLlib: "Equivalent of Spark MLlib" for Deep Learning
Chronos: Scalable Time Series Analysis using AutoML
Friesian: End-to-End Recommendation Systems
PPML: Secure Big Data and AI (with SGX Hardware Security)
IPEX-LLM is an LLM acceleration library for Intel CPU, GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) and NPU. It is built on top of the excellent work of llama.cpp, transformers, bitsandbytes, vLLM, qlora, AutoGPTQ, AutoAWQ, etc. It provides seamless integration with llama.cpp, Ollama, HuggingFace transformers, LangChain, LlamaIndex, vLLM, Text-Generation-WebUI, DeepSpeed-AutoTP, FastChat, Axolotl, HuggingFace PEFT, HuggingFace TRL, AutoGen, ModeScope, etc. 70+ models have been optimized/verified on ipex-llm (e.g., Llama, Phi, Mistral, Mixtral, Whisper, Qwen, MiniCPM, Qwen-VL, MiniCPM-V and more), with state-of-art LLM optimizations, XPU acceleration and low-bit (FP8/FP6/FP4/INT4) support.
'''
def main(args):
input_path = args.input_path
model_path = args.model_path
embed_model_path = args.embed_model_path
query = args.question
# split texts of input doc
@ -62,37 +57,46 @@ def main(args):
texts = text_splitter.split_text(input_doc)
# create embeddings and store into vectordb
embeddings = TransformersEmbeddings.from_model_id(
model_id=model_path,
model_kwargs={"trust_remote_code": True},
device_map='xpu'
)
docsearch = Chroma.from_texts(texts, embeddings, metadatas=[{"source": str(i)} for i in range(len(texts))]).as_retriever()
embeddings = IpexLLMBgeEmbeddings(
model_name=embed_model_path,
model_kwargs={"device": "xpu"},
encode_kwargs={"normalize_embeddings": True},
)
retriever = Chroma.from_texts(texts, embeddings, metadatas=[{"source": str(i)} for i in range(len(texts))]).as_retriever()
#get relavant texts
docs = docsearch.get_relevant_documents(query)
bigdl_llm = TransformersLLM.from_model_id(
llm = IpexLLM.from_model_id(
model_id=model_path,
model_kwargs={"temperature": 0, "max_length": 1024, "trust_remote_code": True},
device_map='xpu'
model_kwargs={
"temperature": 0,
"max_length": 512,
"trust_remote_code": True,
"device": "xpu",
},
)
doc_chain = load_qa_chain(
bigdl_llm, chain_type="stuff", prompt=QA_PROMPT
)
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
prompt = hub.pull("rlm/rag-prompt")
output = doc_chain.run(input_documents=docs, question=query)
print(output)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
rag_chain.invoke(query)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='TransformersLLM Langchain QA over Docs Example')
parser.add_argument('-m','--model-path', type=str, required=True,
help='the path to transformers model')
parser.add_argument('-e','--embed-model-path', type=str, required=True,
help='the path to embedding model')
parser.add_argument('-i', '--input-path', type=str,
help='the path to the input doc.')
parser.add_argument('-q', '--question', type=str, default='What is BigDL?',
parser.add_argument('-q', '--question', type=str, default='What is IPEX-LLM?',
help='qustion you want to ask.')
args = parser.parse_args()

View file

@ -1,45 +0,0 @@
from ipex_llm.langchain.vllm.vllm import VLLM
from langchain.chains import LLMChain
from langchain_core.prompts import PromptTemplate
import argparse
def main(args):
llm = VLLM(
model=args.model_path,
trust_remote_code=True, # mandatory for hf models
max_new_tokens=128,
top_k=10,
top_p=0.95,
temperature=0.8,
max_model_len=2048,
enforce_eager=True,
load_in_low_bit=args.load_in_low_bit,
device="xpu",
tensor_parallel_size=args.tensor_parallel_size,
)
print(llm.invoke(args.question))
template = """Question: {question}
Answer: Let's think step by step."""""
prompt = PromptTemplate.from_template(template)
llm_chain = LLMChain(prompt=prompt, llm=llm)
print(llm_chain.invoke("Who was the US president in the year the first Pokemon game was released?"))
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Langchain integrated vLLM example')
parser.add_argument('-m','--model-path', type=str, required=True,
help='the path to transformers model')
parser.add_argument('-q', '--question', type=str, default='What is the capital of France?', help='qustion you want to ask.')
parser.add_argument('-t', '--max-tokens', type=int, default=128, help='max tokens to generate')
parser.add_argument('-p', '--tensor-parallel-size', type=int, default=1, help="vLLM tensor parallel size")
parser.add_argument('-l', '--load-in-low-bit', type=str, default='sym_int4', help="low bit format")
args = parser.parse_args()
main(args)