Update serving doc (#10475)

* update serving doc

* add tob

* update

* update

* update

* update vllm worker
This commit is contained in:
ZehuaCao 2024-03-20 14:44:43 +08:00 committed by GitHub
parent 4581e4f17f
commit 1d062e24db

View file

@ -1,21 +1,24 @@
## Serving using BigDL-LLM and FastChat
# Serving using BigDL-LLM and FastChat
FastChat is an open platform for training, serving, and evaluating large language model based chatbots. You can find the detailed information at their [homepage](https://github.com/lm-sys/FastChat).
BigDL-LLM can be easily integrated into FastChat so that user can use `BigDL-LLM` as a serving backend in the deployment.
### Working with BigDL-LLM Serving
<details><summary>Table of Contents</summary>
<details>
<summary>Table of contents</summary>
- [Install](#install)
- [Models](#models)
- [Boot Service](#start-the-service)
- [Web GUI](#serving-with-webgui)
- [RESTful API](#serving-with-openai-compatible-restful-apis)
- [Start the service](#start-the-service)
- [Launch controller](#launch-controller)
- [Launch model worker(s) and load models](#launch-model-workers-and-load-models)
- [BigDL model worker](#bigdl-model-worker)
- [BigDL vLLM model worker](#vllm-model-worker)
- [Launch Gradio web server](#launch-gradio-web-server)
- [Launch RESTful API server](#launch-restful-api-server)
</details>
#### Install
## Install
You may install **`bigdl-llm`** with `FastChat` as follows:
@ -27,53 +30,71 @@ pip install --pre --upgrade bigdl-llm[all]
```
To add GPU support for FastChat, you may install **`bigdl-llm`** as follows:
```bash
pip install --pre --upgrade bigdl-llm[xpu, serving] -f https://developer.intel.com/ipex-whl-stable-xpu
```
#### Models
## Start the service
### Launch controller
You need first run the fastchat controller
```bash
python3 -m fastchat.serve.controller
```
### Launch model worker(s) and load models
#### BigDL model worker
Using BigDL-LLM in FastChat does not impose any new limitations on model usage. Therefore, all Hugging Face Transformer models can be utilized in FastChat.
FastChat determines the Model adapter to use through path matching. Therefore, in order to load models using BigDL-LLM, you need to make some modifications to the model's name.
For instance, assuming you have downloaded the `llama-7b-hf` from [HuggingFace](https://huggingface.co/decapoda-research/llama-7b-hf). Then, to use the `BigDL-LLM` as backend, you need to change name from `llama-7b-hf` to `bigdl-7b`.
The key point here is that the model's path should include "bigdl" and **should not include paths matched by other model adapters**.
For instance, assuming you have downloaded the `llama-7b-hf` from [HuggingFace](https://huggingface.co/decapoda-research/llama-7b-hf). Then, to use the `BigDL-LLM` as backend, you need to change name from `llama-7b-hf` to `bigdl-7b`.The key point here is that the model's path should include "bigdl" and **should not include paths matched by other model adapters**.
Then we will use `bigdl-7b` as model-path.
> note: This is caused by the priority of name matching list. The new added `BigDL-LLM` adapter is at the tail of the name-matching list so that it has the lowest priority. If model path contains other keywords like `vicuna` which matches to another adapter with higher priority, then the `BigDL-LLM` adapter will not work.
A special case is `ChatGLM` models. For these models, you do not need to do any changes after downloading the model and the `BigDL-LLM` backend will be used automatically.
Then we can run model workers
#### Start the service
##### Serving with WebGUI
To serve using the Web UI, you need three main components: web servers that interface with users, model workers that host one or more models, and a controller to coordinate the web server and model workers.
###### Launch the Controller
```bash
python3 -m fastchat.serve.controller
# On CPU
python3 -m bigdl.llm.serving.model_worker --model-path PATH/TO/bigdl-7b --device cpu
# On GPU
python3 -m bigdl.llm.serving.model_worker --model-path PATH/TO/bigdl-7b --device xpu
```
This controller manages the distributed workers.
If you run successfully using `BigDL` backend, you can see the output in log like this:
###### Launch the model worker(s)
```bash
python3 -m bigdl.llm.serving.model_worker --model-path lmsys/vicuna-7b-v1.3 --device cpu
INFO - Converting the current model to sym_int4 format......
```
Wait until the process finishes loading the model and you see "Uvicorn running on ...". The model worker will register itself to the controller.
> To run model worker using Intel GPU, simple change the --device cpu option to --device xpu
> note: We currently only support int4 quantization.
#### BigDL vLLM model worker
We also provide the `vllm_worker` which uses the [vLLM](https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/CPU/vLLM-Serving) engine for better hardware utilization.
To run using the `vllm_worker`, just simply uses the following command:
To run using the `vLLM_worker`, we don't need to change model name, just simply uses the following command:
```bash
python3 -m bigdl.llm.serving.vllm_worker --model-path meta-llama/Llama-2-7b-chat-hf --device cpu/xpu # based on your device
# On CPU
python3 -m bigdl.llm.serving.vllm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --device cpu
# On GPU
python3 -m bigdl.llm.serving.vllm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --device xpu
```
###### Launch the Gradio web server
### Launch Gradio web server
```bash
python3 -m fastchat.serve.gradio_web_server
@ -81,26 +102,12 @@ python3 -m fastchat.serve.gradio_web_server
This is the user interface that users will interact with.
By following these steps, you will be able to serve your models using the web UI with `BigDL-LLM` as the backend. You can open your browser and chat with a model now.
By following these steps, you will be able to serve your models using the web UI with BigDL-LLM as the backend. You can open your browser and chat with a model now.
##### Serving with OpenAI-Compatible RESTful APIs
### Launch RESTful API server
To start an OpenAI API server that provides compatible APIs using `BigDL-LLM` backend, you need three main components: an OpenAI API Server that serves the in-coming requests, model workers that host one or more models, and a controller to coordinate the web server and model workers.
First, launch the controller
```bash
python3 -m fastchat.serve.controller
```
Then, launch the model worker(s):
```bash
python3 -m bigdl.llm.serving.model_worker --model-path lmsys/vicuna-7b-v1.3 --device cpu
```
Finally, launch the RESTful API server
To start an OpenAI API server that provides compatible APIs using BigDL-LLM backend, you can launch the `openai_api_server` and follow this [doc](https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md) to use it.
```bash
python3 -m fastchat.serve.openai_api_server --host localhost --port 8000
```
```