ipex-llm/python/llm/src/ipex_llm/serving/fastchat
Guancheng Fu 47bd5f504c
[vLLM]Remove vllm-v1, refactor v2 (#10842)
* remove vllm-v1

* fix format
2024-04-22 17:51:32 +08:00
..
__init__.py Refactor bigdl.llm to ipex_llm (#24) 2024-03-22 15:41:21 +08:00
bigdl_llm_model.py Refactor bigdl.llm to ipex_llm (#24) 2024-03-22 15:41:21 +08:00
ipex_llm_worker.py Fix fastchat top_k (#10560) 2024-03-27 16:01:58 +08:00
model_worker.py Refactor bigdl.llm to ipex_llm (#24) 2024-03-22 15:41:21 +08:00
README.md Replace ipex with ipex-llm (#10554) 2024-03-28 13:54:40 +08:00
vllm_worker.py [vLLM]Remove vllm-v1, refactor v2 (#10842) 2024-04-22 17:51:32 +08:00

Serving using IPEX-LLM and FastChat

FastChat is an open platform for training, serving, and evaluating large language model based chatbots. You can find the detailed information at their homepage.

IPEX-LLM can be easily integrated into FastChat so that user can use IPEX-LLM as a serving backend in the deployment.

Table of contents

Install

You may install ipex-llm with FastChat as follows:

pip install --pre --upgrade ipex-llm[serving]

# Or
pip install --pre --upgrade ipex-llm[all]

To add GPU support for FastChat, you may install ipex-llm as follows:

pip install --pre --upgrade ipex-llm[xpu,serving] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

Start the service

Launch controller

You need first run the fastchat controller

python3 -m fastchat.serve.controller

Launch model worker(s) and load models

Using IPEX-LLM in FastChat does not impose any new limitations on model usage. Therefore, all Hugging Face Transformer models can be utilized in FastChat.

IPEX-LLM model worker (deprecated)

Warning: This method has been deprecated, please change to use IPEX-LLM worker instead.

FastChat determines the Model adapter to use through path matching. Therefore, in order to load models using IPEX-LLM, you need to make some modifications to the model's name.

For instance, assuming you have downloaded the llama-7b-hf from HuggingFace. Then, to use the IPEX-LLM as backend, you need to change name from llama-7b-hf to ipex-llm-7b.The key point here is that the model's path should include "ipex" and should not include paths matched by other model adapters.

Then we will use ipex-llm-7b as model-path.

note: This is caused by the priority of name matching list. The new added IPEX-LLM adapter is at the tail of the name-matching list so that it has the lowest priority. If model path contains other keywords like vicuna which matches to another adapter with higher priority, then the IPEX-LLM adapter will not work.

A special case is ChatGLM models. For these models, you do not need to do any changes after downloading the model and the IPEX-LLM backend will be used automatically.

Then we can run model workers

# On CPU
python3 -m ipex_llm.serving.fastchat.model_worker --model-path PATH/TO/ipex-llm-7b --device cpu

# On GPU
python3 -m ipex_llm.serving.fastchat.model_worker --model-path PATH/TO/ipex-llm-7b --device xpu

If you run successfully using ipex_llm backend, you can see the output in log like this:

INFO - Converting the current model to sym_int4 format......

note: We currently only support int4 quantization for this method.

IPEX-LLM worker

To integrate IPEX-LLM with FastChat efficiently, we have provided a new model_worker implementation named ipex_llm_worker.py.

To run the ipex_llm_worker on CPU, using the following code:

source ipex-llm-init -t

# Available low_bit format including sym_int4, sym_int8, bf16 etc.
python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path lmsys/vicuna-7b-v1.5 --low-bit "sym_int4" --trust-remote-code --device "cpu"

For GPU example:

# Available low_bit format including sym_int4, sym_int8, fp16 etc.
python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path lmsys/vicuna-7b-v1.5 --low-bit "sym_int4" --trust-remote-code --device "xpu"

For a full list of accepted arguments, you can refer to the main method of the ipex_llm_worker.py

IPEX-LLM vLLM worker

We also provide the vllm_worker which uses the vLLM engine for better hardware utilization.

To run using the vLLM_worker, we don't need to change model name, just simply uses the following command:

# On CPU
python3 -m ipex_llm.serving.fastchat.vllm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --device cpu

# On GPU
python3 -m ipex_llm.serving.fastchat.vllm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --device xpu

Launch Gradio web server

python3 -m fastchat.serve.gradio_web_server

This is the user interface that users will interact with.

By following these steps, you will be able to serve your models using the web UI with IPEX-LLM as the backend. You can open your browser and chat with a model now.

Launch RESTful API server

To start an OpenAI API server that provides compatible APIs using IPEX-LLM backend, you can launch the openai_api_server and follow this doc to use it.

python3 -m fastchat.serve.openai_api_server --host localhost --port 8000