ipex-llm

History

Guancheng Fu 2e1448f08e [Serving] Add vllm_worker to fastchat serving framework (#9934 ) * add worker * finish * finish * add license * add more comments		2024-01-18 21:33:36 +08:00
..
__init__.py	[LLM] Integrate FastChat as a serving framework for BigDL-LLM (#8821 )	2023-09-13 09:28:05 +08:00
bigdl_llm_model.py	[LLM] Add trust_remote_code for local renamed model in bigdl_llm_model.py (#9762 )	2023-12-25 11:31:14 +08:00
model_worker.py	[LLM] Integrate FastChat as a serving framework for BigDL-LLM (#8821 )	2023-09-13 09:28:05 +08:00
README.md	[Serving] Add vllm_worker to fastchat serving framework (#9934 )	2024-01-18 21:33:36 +08:00
vllm_worker.py	[Serving] Add vllm_worker to fastchat serving framework (#9934 )	2024-01-18 21:33:36 +08:00

README.md

Serving using BigDL-LLM and FastChat

FastChat is an open platform for training, serving, and evaluating large language model based chatbots. You can find the detailed information at their homepage.

BigDL-LLM can be easily integrated into FastChat so that user can use BigDL-LLM as a serving backend in the deployment.

Working with BigDL-LLM Serving

Table of Contents

Install
Models
Boot Service
- Web GUI
- RESTful API

Install

You may install bigdl-llm with FastChat as follows:

pip install --pre --upgrade bigdl-llm[serving]

# Or
pip install --pre --upgrade bigdl-llm[all]

To add GPU support for FastChat, you may install bigdl-llm as follows:

pip install --pre --upgrade bigdl-llm[xpu, serving] -f https://developer.intel.com/ipex-whl-stable-xpu

Models

Using BigDL-LLM in FastChat does not impose any new limitations on model usage. Therefore, all Hugging Face Transformer models can be utilized in FastChat.

FastChat determines the Model adapter to use through path matching. Therefore, in order to load models using BigDL-LLM, you need to make some modifications to the model's name.

For instance, assuming you have downloaded the llama-7b-hf from HuggingFace. Then, to use the BigDL-LLM as backend, you need to change name from llama-7b-hf to bigdl-7b. The key point here is that the model's path should include "bigdl" and should not include paths matched by other model adapters.

note: This is caused by the priority of name matching list. The new added BigDL-LLM adapter is at the tail of the name-matching list so that it has the lowest priority. If model path contains other keywords like vicuna which matches to another adapter with higher priority, then the BigDL-LLM adapter will not work.

A special case is ChatGLM models. For these models, you do not need to do any changes after downloading the model and the BigDL-LLM backend will be used automatically.

Start the service

Serving with WebGUI

To serve using the Web UI, you need three main components: web servers that interface with users, model workers that host one or more models, and a controller to coordinate the web server and model workers.

Launch the Controller

python3 -m fastchat.serve.controller

This controller manages the distributed workers.

Launch the model worker(s)

python3 -m bigdl.llm.serving.model_worker --model-path lmsys/vicuna-7b-v1.3 --device cpu

Wait until the process finishes loading the model and you see "Uvicorn running on ...". The model worker will register itself to the controller.

To run model worker using Intel GPU, simple change the --device cpu option to --device xpu

We also provide the vllm_worker which uses the vLLM engine for better hardware utilization.

To run using the vllm_worker, just simply uses the following command:

python3 -m bigdl.llm.serving.vllm_worker --model-path meta-llama/Llama-2-7b-chat-hf --device cpu/xpu # based on your device

Launch the Gradio web server

python3 -m fastchat.serve.gradio_web_server

This is the user interface that users will interact with.

By following these steps, you will be able to serve your models using the web UI with BigDL-LLM as the backend. You can open your browser and chat with a model now.

Serving with OpenAI-Compatible RESTful APIs

To start an OpenAI API server that provides compatible APIs using BigDL-LLM backend, you need three main components: an OpenAI API Server that serves the in-coming requests, model workers that host one or more models, and a controller to coordinate the web server and model workers.

First, launch the controller

python3 -m fastchat.serve.controller

Then, launch the model worker(s):

python3 -m bigdl.llm.serving.model_worker --model-path lmsys/vicuna-7b-v1.3 --device cpu

Finally, launch the RESTful API server

python3 -m fastchat.serve.openai_api_server --host localhost --port 8000