Add fastchat quickstart (#10688)
* add fastchat quickstart * update * update * update
This commit is contained in:
parent
ea5e46c8cb
commit
a7c12020b4
4 changed files with 244 additions and 0 deletions
|
|
@ -49,6 +49,9 @@
|
|||
<li>
|
||||
<a href="doc/LLM/Quickstart/ollama_quickstart.html">Run Ollama with IPEX-LLM on Intel GPU</a>
|
||||
</li>
|
||||
<li>
|
||||
<a href="doc/LLM/Quickstart/fastchat_quickstart.html">Run IPEX-LLM Serving with FastChat</a>
|
||||
</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>
|
||||
|
|
|
|||
|
|
@ -30,6 +30,7 @@ subtrees:
|
|||
- file: doc/LLM/Quickstart/benchmark_quickstart
|
||||
- file: doc/LLM/Quickstart/llama_cpp_quickstart
|
||||
- file: doc/LLM/Quickstart/ollama_quickstart
|
||||
- file: doc/LLM/Quickstart/fastchat_quickstart
|
||||
- file: doc/LLM/Overview/KeyFeatures/index
|
||||
title: "Key Features"
|
||||
subtrees:
|
||||
|
|
|
|||
|
|
@ -0,0 +1,239 @@
|
|||
# Serving using IPEX-LLM and FastChat
|
||||
|
||||
FastChat is an open platform for training, serving, and evaluating large language model based chatbots. You can find the detailed information at their [homepage](https://github.com/lm-sys/FastChat).
|
||||
|
||||
IPEX-LLM can be easily integrated into FastChat so that user can use `IPEX-LLM` as a serving backend in the deployment.
|
||||
|
||||
## Quick Start
|
||||
|
||||
This quickstart guide walks you through installing and running `FastChat` with `ipex-llm`.
|
||||
|
||||
## 1. Install IPEX-LLM with FastChat
|
||||
|
||||
To run on CPU, you can install ipex-llm as follows:
|
||||
|
||||
```bash
|
||||
pip install --pre --upgrade ipex-llm[serving,all]
|
||||
```
|
||||
|
||||
To add GPU support for FastChat, you may install **`ipex-llm`** as follows:
|
||||
|
||||
```bash
|
||||
pip install --pre --upgrade ipex-llm[xpu,serving] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
|
||||
|
||||
```
|
||||
|
||||
## 2. Start the service
|
||||
|
||||
### Launch controller
|
||||
|
||||
You need first run the fastchat controller
|
||||
|
||||
```bash
|
||||
python3 -m fastchat.serve.controller
|
||||
```
|
||||
|
||||
If the controller run successfully, you can see the output like this:
|
||||
|
||||
```bash
|
||||
Uvicorn running on http://localhost:21001
|
||||
```
|
||||
|
||||
### Launch model worker(s) and load models
|
||||
|
||||
Using IPEX-LLM in FastChat does not impose any new limitations on model usage. Therefore, all Hugging Face Transformer models can be utilized in FastChat.
|
||||
|
||||
#### IPEX-LLM worker
|
||||
|
||||
To integrate IPEX-LLM with `FastChat` efficiently, we have provided a new model_worker implementation named `ipex_llm_worker.py`.
|
||||
|
||||
```bash
|
||||
# On CPU
|
||||
# Available low_bit format including sym_int4, sym_int8, bf16 etc.
|
||||
python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "cpu"
|
||||
|
||||
# On GPU
|
||||
# Available low_bit format including sym_int4, sym_int8, fp16 etc.
|
||||
source /opt/intel/oneapi/setvars.sh
|
||||
export USE_XETLA=OFF
|
||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
|
||||
|
||||
python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "xpu"
|
||||
```
|
||||
|
||||
You can get output like this:
|
||||
|
||||
```bash
|
||||
2024-04-12 18:18:09 | INFO | ipex_llm.transformers.utils | Converting the current model to sym_int4 format......
|
||||
2024-04-12 18:18:11 | INFO | model_worker | Register to controller
|
||||
2024-04-12 18:18:11 | ERROR | stderr | INFO: Started server process [126133]
|
||||
2024-04-12 18:18:11 | ERROR | stderr | INFO: Waiting for application startup.
|
||||
2024-04-12 18:18:11 | ERROR | stderr | INFO: Application startup complete.
|
||||
2024-04-12 18:18:11 | ERROR | stderr | INFO: Uvicorn running on http://localhost:21002
|
||||
```
|
||||
|
||||
For a full list of accepted arguments, you can refer to the main method of the `ipex_llm_worker.py`
|
||||
|
||||
#### IPEX-LLM vLLM worker
|
||||
|
||||
We also provide the `vllm_worker` which uses the [vLLM](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/vLLM-Serving) engine for better hardware utilization.
|
||||
|
||||
To run using the `vLLM_worker`, we don't need to change model name, just simply uses the following command:
|
||||
|
||||
```bash
|
||||
# On CPU
|
||||
python3 -m ipex_llm.serving.fastchat.vllm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --device cpu
|
||||
|
||||
# On GPU
|
||||
source /opt/intel/oneapi/setvars.sh
|
||||
export USE_XETLA=OFF
|
||||
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
|
||||
python3 -m ipex_llm.serving.fastchat.vllm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --device xpu
|
||||
```
|
||||
|
||||
### Launch Gradio web server
|
||||
|
||||
When you have started the controller and the worker, you can start web server as follows:
|
||||
|
||||
```bash
|
||||
python3 -m fastchat.serve.gradio_web_server
|
||||
```
|
||||
|
||||
This is the user interface that users will interact with.
|
||||
|
||||
<a href="https://llm-assets.readthedocs.io/en/latest/_images/fastchat_gradio_web_ui.png" target="_blank">
|
||||
<img src="https://llm-assets.readthedocs.io/en/latest/_images/fastchat_gradio_web_ui.png" width=100%; />
|
||||
</a>
|
||||
|
||||
By following these steps, you will be able to serve your models using the web UI with IPEX-LLM as the backend. You can open your browser and chat with a model now.
|
||||
|
||||
### Launch RESTful API server
|
||||
|
||||
To start an OpenAI API server that provides compatible APIs using IPEX-LLM backend, you can launch the `openai_api_server` and follow this [doc](https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md) to use it.
|
||||
|
||||
When you have started the controller and the worker, you can start RESTful API server as follows:
|
||||
|
||||
```bash
|
||||
python3 -m fastchat.serve.openai_api_server --host localhost --port 8000
|
||||
```
|
||||
|
||||
You can use `curl` for observing the output of the api
|
||||
|
||||
You can format the output using `jq`
|
||||
|
||||
#### List Models
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/models | jq
|
||||
```
|
||||
|
||||
Example output
|
||||
|
||||
```json
|
||||
|
||||
{
|
||||
"object": "list",
|
||||
"data": [
|
||||
{
|
||||
"id": "Llama-2-7b-chat-hf",
|
||||
"object": "model",
|
||||
"created": 1712919071,
|
||||
"owned_by": "fastchat",
|
||||
"root": "Llama-2-7b-chat-hf",
|
||||
"parent": null,
|
||||
"permission": [
|
||||
{
|
||||
"id": "modelperm-XpFyEE7Sewx4XYbEcdbCVz",
|
||||
"object": "model_permission",
|
||||
"created": 1712919071,
|
||||
"allow_create_engine": false,
|
||||
"allow_sampling": true,
|
||||
"allow_logprobs": true,
|
||||
"allow_search_indices": true,
|
||||
"allow_view": true,
|
||||
"allow_fine_tuning": false,
|
||||
"organization": "*",
|
||||
"group": null,
|
||||
"is_blocking": false
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
#### Chat Completions
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "Llama-2-7b-chat-hf",
|
||||
"messages": [{"role": "user", "content": "Hello! What is your name?"}]
|
||||
}' | jq
|
||||
```
|
||||
|
||||
Example output
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "chatcmpl-jJ9vKSGkcDMTxKfLxK7q2x",
|
||||
"object": "chat.completion",
|
||||
"created": 1712919092,
|
||||
"model": "Llama-2-7b-chat-hf",
|
||||
"choices": [
|
||||
{
|
||||
"index": 0,
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": " Hello! My name is LLaMA, I'm a large language model trained by a team of researcher at Meta AI. Unterscheidung. 😊"
|
||||
},
|
||||
"finish_reason": "stop"
|
||||
}
|
||||
],
|
||||
"usage": {
|
||||
"prompt_tokens": 15,
|
||||
"total_tokens": 53,
|
||||
"completion_tokens": 38
|
||||
}
|
||||
}
|
||||
|
||||
```
|
||||
|
||||
#### Text Completions
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "Llama-2-7b-chat-hf",
|
||||
"prompt": "Once upon a time",
|
||||
"max_tokens": 41,
|
||||
"temperature": 0.5
|
||||
}' | jq
|
||||
```
|
||||
|
||||
Example Output:
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "cmpl-PsAkpTWMmBLzWCTtM4r97Y",
|
||||
"object": "text_completion",
|
||||
"created": 1712919307,
|
||||
"model": "Llama-2-7b-chat-hf",
|
||||
"choices": [
|
||||
{
|
||||
"index": 0,
|
||||
"text": ", in a far-off land, there was a magical kingdom called \"Happily Ever Laughter.\" It was a place where laughter was the key to happiness, and everyone who ",
|
||||
"logprobs": null,
|
||||
"finish_reason": "length"
|
||||
}
|
||||
],
|
||||
"usage": {
|
||||
"prompt_tokens": 5,
|
||||
"total_tokens": 45,
|
||||
"completion_tokens": 40
|
||||
}
|
||||
}
|
||||
|
||||
```
|
||||
|
|
@ -19,6 +19,7 @@ This section includes efficient guide to show you how to:
|
|||
* `Run Coding Copilot (Continue) in VSCode with Intel GPU <./continue_quickstart.html>`_
|
||||
* `Run llama.cpp with IPEX-LLM on Intel GPU <./llama_cpp_quickstart.html>`_
|
||||
* `Run Ollama with IPEX-LLM on Intel GPU <./ollama_quickstart.html>`_
|
||||
* `Run IPEX-LLM Serving with FastChat <./fastchat_quickstart.html>`_
|
||||
|
||||
.. |bigdl_llm_migration_guide| replace:: ``bigdl-llm`` Migration Guide
|
||||
.. _bigdl_llm_migration_guide: bigdl_llm_migration.html
|
||||
|
|
|
|||
Loading…
Reference in a new issue