Update serving doc (#10475)

* update serving doc * add tob * update * update * update * update vllm worker
2024-03-20 14:44:43 +08:00 · 2024-03-20 14:44:43 +08:00 · 1d062e24db
commit 1d062e24db
parent 4581e4f17f
1 changed files with 53 additions and 46 deletions
--- a/python/llm/src/bigdl/llm/serving/README.md
+++ b/python/llm/src/bigdl/llm/serving/README.md
@ -1,21 +1,24 @@
-## Serving using BigDL-LLM and FastChat
+# Serving using BigDL-LLM and FastChat

 FastChat is an open platform for training, serving, and evaluating large language model based chatbots. You can find the detailed information at their [homepage](https://github.com/lm-sys/FastChat).

 BigDL-LLM can be easily integrated into FastChat so that user can use `BigDL-LLM` as a serving backend in the deployment.

-### Working with BigDL-LLM Serving
-
-<details><summary>Table of Contents</summary>
+<details>
+<summary>Table of contents</summary>

 - [Install](#install)
- [Models](#models)
- [Boot Service](#start-the-service)
-  - [Web GUI](#serving-with-webgui)
-  - [RESTful API](#serving-with-openai-compatible-restful-apis)
+- [Start the service](#start-the-service)
+  - [Launch controller](#launch-controller)
+  - [Launch model worker(s) and load models](#launch-model-workers-and-load-models)
+    - [BigDL model worker](#bigdl-model-worker)
+    - [BigDL vLLM model worker](#vllm-model-worker)
+  - [Launch Gradio web server](#launch-gradio-web-server)
+  - [Launch RESTful API server](#launch-restful-api-server)
+
 </details>

-#### Install
+## Install

 You may install **`bigdl-llm`** with `FastChat` as follows:

@ -27,53 +30,71 @@ pip install --pre --upgrade bigdl-llm[all]
 ```

 To add GPU support for FastChat, you may install **`bigdl-llm`** as follows:
+
 ```bash
 pip install --pre --upgrade bigdl-llm[xpu, serving] -f https://developer.intel.com/ipex-whl-stable-xpu
+
 ```

-#### Models
+## Start the service
+
+### Launch controller
+
+You need first run the fastchat controller
+
+```bash
+python3 -m fastchat.serve.controller
+```
+
+### Launch model worker(s) and load models
+
+#### BigDL model worker

 Using BigDL-LLM in FastChat does not impose any new limitations on model usage. Therefore, all Hugging Face Transformer models can be utilized in FastChat.

 FastChat determines the Model adapter to use through path matching. Therefore, in order to load models using BigDL-LLM, you need to make some modifications to the model's name.

-For instance, assuming you have downloaded the `llama-7b-hf` from [HuggingFace](https://huggingface.co/decapoda-research/llama-7b-hf).  Then, to use the `BigDL-LLM` as backend, you need to change name from `llama-7b-hf` to `bigdl-7b`.
-The key point here is that the model's path should include "bigdl" and **should not include paths matched by other model adapters**.
+For instance, assuming you have downloaded the `llama-7b-hf` from [HuggingFace](https://huggingface.co/decapoda-research/llama-7b-hf).  Then, to use the `BigDL-LLM` as backend, you need to change name from `llama-7b-hf` to `bigdl-7b`.The key point here is that the model's path should include "bigdl" and **should not include paths matched by other model adapters**.
+
+Then we will use `bigdl-7b` as model-path.

 > note: This is caused by the priority of name matching list. The new added `BigDL-LLM` adapter is at the tail of the name-matching list so that it has the lowest priority. If model path contains other keywords like `vicuna` which matches to another adapter with higher priority, then the `BigDL-LLM` adapter will not work.

 A special case is `ChatGLM` models. For these models, you do not need to do any changes after downloading the model and the `BigDL-LLM` backend will be used automatically.

+Then we can run model workers

-#### Start the service
-
-##### Serving with WebGUI
-
-To serve using the Web UI, you need three main components: web servers that interface with users, model workers that host one or more models, and a controller to coordinate the web server and model workers.
-
-###### Launch the Controller
 ```bash
-python3 -m fastchat.serve.controller
+# On CPU
+python3 -m bigdl.llm.serving.model_worker --model-path PATH/TO/bigdl-7b --device cpu
+
+# On GPU
+python3 -m bigdl.llm.serving.model_worker --model-path PATH/TO/bigdl-7b --device xpu
 ```

-This controller manages the distributed workers.
+If you run successfully using `BigDL` backend, you can see the output in log like this:

-###### Launch the model worker(s)
 ```bash
-python3 -m bigdl.llm.serving.model_worker --model-path lmsys/vicuna-7b-v1.3 --device cpu
+INFO - Converting the current model to sym_int4 format......
 ```
-Wait until the process finishes loading the model and you see "Uvicorn running on ...". The model worker will register itself to the controller.

-> To run model worker using Intel GPU, simple change the --device cpu option to --device xpu
+> note: We currently only support int4 quantization.
+
+#### BigDL vLLM model worker

 We also provide the `vllm_worker` which uses the [vLLM](https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/CPU/vLLM-Serving) engine for better hardware utilization.

-To run using the `vllm_worker`, just simply uses the following command:
+To run using the `vLLM_worker`,  we don't need to change model name, just simply uses the following command:
+
 ```bash
-python3 -m bigdl.llm.serving.vllm_worker --model-path meta-llama/Llama-2-7b-chat-hf --device cpu/xpu # based on your device
+# On CPU
+python3 -m bigdl.llm.serving.vllm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --device cpu
+
+# On GPU
+python3 -m bigdl.llm.serving.vllm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --device xpu
 ```

-###### Launch the Gradio web server
+### Launch Gradio web server

 ```bash
 python3 -m fastchat.serve.gradio_web_server
@ -81,26 +102,12 @@ python3 -m fastchat.serve.gradio_web_server

 This is the user interface that users will interact with.

-By following these steps, you will be able to serve your models using the web UI with `BigDL-LLM` as the backend. You can open your browser and chat with a model now.
+By following these steps, you will be able to serve your models using the web UI with BigDL-LLM as the backend. You can open your browser and chat with a model now.

-##### Serving with OpenAI-Compatible RESTful APIs
+### Launch RESTful API server

-To start an OpenAI API server that provides compatible APIs using `BigDL-LLM` backend, you need three main components: an OpenAI API Server that serves the in-coming requests, model workers that host one or more models, and a controller to coordinate the web server and model workers.
-
-First, launch the controller
-
-```bash
-python3 -m fastchat.serve.controller
-```
-
-Then, launch the model worker(s):
-
-```bash
-python3 -m bigdl.llm.serving.model_worker --model-path lmsys/vicuna-7b-v1.3 --device cpu
-```
-
-Finally, launch the RESTful API server
+To start an OpenAI API server that provides compatible APIs using BigDL-LLM backend, you can launch the `openai_api_server` and follow this [doc](https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md) to use it.

 ```bash
 python3 -m fastchat.serve.openai_api_server --host localhost --port 8000
-```
+```