|
|
||
|---|---|---|
| .. | ||
| __init__.py | ||
| bigdl_llm_model.py | ||
| model_worker.py | ||
| README.md | ||
| vllm_worker.py | ||
Serving using BigDL-LLM and FastChat
FastChat is an open platform for training, serving, and evaluating large language model based chatbots. You can find the detailed information at their homepage.
BigDL-LLM can be easily integrated into FastChat so that user can use BigDL-LLM as a serving backend in the deployment.
Working with BigDL-LLM Serving
Table of Contents
Install
You may install bigdl-llm with FastChat as follows:
pip install --pre --upgrade bigdl-llm[serving]
# Or
pip install --pre --upgrade bigdl-llm[all]
To add GPU support for FastChat, you may install bigdl-llm as follows:
pip install --pre --upgrade bigdl-llm[xpu, serving] -f https://developer.intel.com/ipex-whl-stable-xpu
Models
Using BigDL-LLM in FastChat does not impose any new limitations on model usage. Therefore, all Hugging Face Transformer models can be utilized in FastChat.
FastChat determines the Model adapter to use through path matching. Therefore, in order to load models using BigDL-LLM, you need to make some modifications to the model's name.
For instance, assuming you have downloaded the llama-7b-hf from HuggingFace. Then, to use the BigDL-LLM as backend, you need to change name from llama-7b-hf to bigdl-7b.
The key point here is that the model's path should include "bigdl" and should not include paths matched by other model adapters.
note: This is caused by the priority of name matching list. The new added
BigDL-LLMadapter is at the tail of the name-matching list so that it has the lowest priority. If model path contains other keywords likevicunawhich matches to another adapter with higher priority, then theBigDL-LLMadapter will not work.
A special case is ChatGLM models. For these models, you do not need to do any changes after downloading the model and the BigDL-LLM backend will be used automatically.
Start the service
Serving with WebGUI
To serve using the Web UI, you need three main components: web servers that interface with users, model workers that host one or more models, and a controller to coordinate the web server and model workers.
Launch the Controller
python3 -m fastchat.serve.controller
This controller manages the distributed workers.
Launch the model worker(s)
python3 -m bigdl.llm.serving.model_worker --model-path lmsys/vicuna-7b-v1.3 --device cpu
Wait until the process finishes loading the model and you see "Uvicorn running on ...". The model worker will register itself to the controller.
To run model worker using Intel GPU, simple change the --device cpu option to --device xpu
We also provide the vllm_worker which uses the vLLM engine for better hardware utilization.
To run using the vllm_worker, just simply uses the following command:
python3 -m bigdl.llm.serving.vllm_worker --model-path meta-llama/Llama-2-7b-chat-hf --device cpu/xpu # based on your device
Launch the Gradio web server
python3 -m fastchat.serve.gradio_web_server
This is the user interface that users will interact with.
By following these steps, you will be able to serve your models using the web UI with BigDL-LLM as the backend. You can open your browser and chat with a model now.
Serving with OpenAI-Compatible RESTful APIs
To start an OpenAI API server that provides compatible APIs using BigDL-LLM backend, you need three main components: an OpenAI API Server that serves the in-coming requests, model workers that host one or more models, and a controller to coordinate the web server and model workers.
First, launch the controller
python3 -m fastchat.serve.controller
Then, launch the model worker(s):
python3 -m bigdl.llm.serving.model_worker --model-path lmsys/vicuna-7b-v1.3 --device cpu
Finally, launch the RESTful API server
python3 -m fastchat.serve.openai_api_server --host localhost --port 8000