Update serving doc (#10475)
* update serving doc * add tob * update * update * update * update vllm worker
This commit is contained in:
parent
4581e4f17f
commit
1d062e24db
1 changed files with 53 additions and 46 deletions
|
|
@ -1,21 +1,24 @@
|
|||
## Serving using BigDL-LLM and FastChat
|
||||
# Serving using BigDL-LLM and FastChat
|
||||
|
||||
FastChat is an open platform for training, serving, and evaluating large language model based chatbots. You can find the detailed information at their [homepage](https://github.com/lm-sys/FastChat).
|
||||
|
||||
BigDL-LLM can be easily integrated into FastChat so that user can use `BigDL-LLM` as a serving backend in the deployment.
|
||||
|
||||
### Working with BigDL-LLM Serving
|
||||
|
||||
<details><summary>Table of Contents</summary>
|
||||
<details>
|
||||
<summary>Table of contents</summary>
|
||||
|
||||
- [Install](#install)
|
||||
- [Models](#models)
|
||||
- [Boot Service](#start-the-service)
|
||||
- [Web GUI](#serving-with-webgui)
|
||||
- [RESTful API](#serving-with-openai-compatible-restful-apis)
|
||||
- [Start the service](#start-the-service)
|
||||
- [Launch controller](#launch-controller)
|
||||
- [Launch model worker(s) and load models](#launch-model-workers-and-load-models)
|
||||
- [BigDL model worker](#bigdl-model-worker)
|
||||
- [BigDL vLLM model worker](#vllm-model-worker)
|
||||
- [Launch Gradio web server](#launch-gradio-web-server)
|
||||
- [Launch RESTful API server](#launch-restful-api-server)
|
||||
|
||||
</details>
|
||||
|
||||
#### Install
|
||||
## Install
|
||||
|
||||
You may install **`bigdl-llm`** with `FastChat` as follows:
|
||||
|
||||
|
|
@ -27,53 +30,71 @@ pip install --pre --upgrade bigdl-llm[all]
|
|||
```
|
||||
|
||||
To add GPU support for FastChat, you may install **`bigdl-llm`** as follows:
|
||||
|
||||
```bash
|
||||
pip install --pre --upgrade bigdl-llm[xpu, serving] -f https://developer.intel.com/ipex-whl-stable-xpu
|
||||
|
||||
```
|
||||
|
||||
#### Models
|
||||
## Start the service
|
||||
|
||||
### Launch controller
|
||||
|
||||
You need first run the fastchat controller
|
||||
|
||||
```bash
|
||||
python3 -m fastchat.serve.controller
|
||||
```
|
||||
|
||||
### Launch model worker(s) and load models
|
||||
|
||||
#### BigDL model worker
|
||||
|
||||
Using BigDL-LLM in FastChat does not impose any new limitations on model usage. Therefore, all Hugging Face Transformer models can be utilized in FastChat.
|
||||
|
||||
FastChat determines the Model adapter to use through path matching. Therefore, in order to load models using BigDL-LLM, you need to make some modifications to the model's name.
|
||||
|
||||
For instance, assuming you have downloaded the `llama-7b-hf` from [HuggingFace](https://huggingface.co/decapoda-research/llama-7b-hf). Then, to use the `BigDL-LLM` as backend, you need to change name from `llama-7b-hf` to `bigdl-7b`.
|
||||
The key point here is that the model's path should include "bigdl" and **should not include paths matched by other model adapters**.
|
||||
For instance, assuming you have downloaded the `llama-7b-hf` from [HuggingFace](https://huggingface.co/decapoda-research/llama-7b-hf). Then, to use the `BigDL-LLM` as backend, you need to change name from `llama-7b-hf` to `bigdl-7b`.The key point here is that the model's path should include "bigdl" and **should not include paths matched by other model adapters**.
|
||||
|
||||
Then we will use `bigdl-7b` as model-path.
|
||||
|
||||
> note: This is caused by the priority of name matching list. The new added `BigDL-LLM` adapter is at the tail of the name-matching list so that it has the lowest priority. If model path contains other keywords like `vicuna` which matches to another adapter with higher priority, then the `BigDL-LLM` adapter will not work.
|
||||
|
||||
A special case is `ChatGLM` models. For these models, you do not need to do any changes after downloading the model and the `BigDL-LLM` backend will be used automatically.
|
||||
|
||||
Then we can run model workers
|
||||
|
||||
#### Start the service
|
||||
|
||||
##### Serving with WebGUI
|
||||
|
||||
To serve using the Web UI, you need three main components: web servers that interface with users, model workers that host one or more models, and a controller to coordinate the web server and model workers.
|
||||
|
||||
###### Launch the Controller
|
||||
```bash
|
||||
python3 -m fastchat.serve.controller
|
||||
# On CPU
|
||||
python3 -m bigdl.llm.serving.model_worker --model-path PATH/TO/bigdl-7b --device cpu
|
||||
|
||||
# On GPU
|
||||
python3 -m bigdl.llm.serving.model_worker --model-path PATH/TO/bigdl-7b --device xpu
|
||||
```
|
||||
|
||||
This controller manages the distributed workers.
|
||||
If you run successfully using `BigDL` backend, you can see the output in log like this:
|
||||
|
||||
###### Launch the model worker(s)
|
||||
```bash
|
||||
python3 -m bigdl.llm.serving.model_worker --model-path lmsys/vicuna-7b-v1.3 --device cpu
|
||||
INFO - Converting the current model to sym_int4 format......
|
||||
```
|
||||
Wait until the process finishes loading the model and you see "Uvicorn running on ...". The model worker will register itself to the controller.
|
||||
|
||||
> To run model worker using Intel GPU, simple change the --device cpu option to --device xpu
|
||||
> note: We currently only support int4 quantization.
|
||||
|
||||
#### BigDL vLLM model worker
|
||||
|
||||
We also provide the `vllm_worker` which uses the [vLLM](https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/CPU/vLLM-Serving) engine for better hardware utilization.
|
||||
|
||||
To run using the `vllm_worker`, just simply uses the following command:
|
||||
To run using the `vLLM_worker`, we don't need to change model name, just simply uses the following command:
|
||||
|
||||
```bash
|
||||
python3 -m bigdl.llm.serving.vllm_worker --model-path meta-llama/Llama-2-7b-chat-hf --device cpu/xpu # based on your device
|
||||
# On CPU
|
||||
python3 -m bigdl.llm.serving.vllm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --device cpu
|
||||
|
||||
# On GPU
|
||||
python3 -m bigdl.llm.serving.vllm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --device xpu
|
||||
```
|
||||
|
||||
###### Launch the Gradio web server
|
||||
### Launch Gradio web server
|
||||
|
||||
```bash
|
||||
python3 -m fastchat.serve.gradio_web_server
|
||||
|
|
@ -81,26 +102,12 @@ python3 -m fastchat.serve.gradio_web_server
|
|||
|
||||
This is the user interface that users will interact with.
|
||||
|
||||
By following these steps, you will be able to serve your models using the web UI with `BigDL-LLM` as the backend. You can open your browser and chat with a model now.
|
||||
By following these steps, you will be able to serve your models using the web UI with BigDL-LLM as the backend. You can open your browser and chat with a model now.
|
||||
|
||||
##### Serving with OpenAI-Compatible RESTful APIs
|
||||
### Launch RESTful API server
|
||||
|
||||
To start an OpenAI API server that provides compatible APIs using `BigDL-LLM` backend, you need three main components: an OpenAI API Server that serves the in-coming requests, model workers that host one or more models, and a controller to coordinate the web server and model workers.
|
||||
|
||||
First, launch the controller
|
||||
|
||||
```bash
|
||||
python3 -m fastchat.serve.controller
|
||||
```
|
||||
|
||||
Then, launch the model worker(s):
|
||||
|
||||
```bash
|
||||
python3 -m bigdl.llm.serving.model_worker --model-path lmsys/vicuna-7b-v1.3 --device cpu
|
||||
```
|
||||
|
||||
Finally, launch the RESTful API server
|
||||
To start an OpenAI API server that provides compatible APIs using BigDL-LLM backend, you can launch the `openai_api_server` and follow this [doc](https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md) to use it.
|
||||
|
||||
```bash
|
||||
python3 -m fastchat.serve.openai_api_server --host localhost --port 8000
|
||||
```
|
||||
```
|
||||
|
|
|
|||
Loading…
Reference in a new issue