# Serving using IPEX-LLM and FastChat

FastChat is an open platform for training, serving, and evaluating large language model based chatbots. You can find the detailed information at their [homepage](https://github.com/lm-sys/FastChat).

IPEX-LLM can be easily integrated into FastChat so that user can use `IPEX-LLM` as a serving backend in the deployment.

## Quick Start

This quickstart guide walks you through installing and running `FastChat` with `ipex-llm`.

## 1. Install IPEX-LLM with FastChat

To run on CPU, you can install ipex-llm as follows:

```bash
pip install --pre --upgrade ipex-llm[serving,all]
```

To add GPU support for FastChat, you may install **`ipex-llm`** as follows:

```bash
pip install --pre --upgrade ipex-llm[xpu,serving] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

```

## 2. Start the service

### Launch controller

You need first run the fastchat controller

```bash
python3 -m fastchat.serve.controller
```

If the controller run successfully, you can see the output like this:

```bash
Uvicorn running on http://localhost:21001
```

### Launch model worker(s) and load models

Using IPEX-LLM in FastChat does not impose any new limitations on model usage. Therefore, all Hugging Face Transformer models can be utilized in FastChat.

#### IPEX-LLM worker

To integrate IPEX-LLM with `FastChat` efficiently, we have provided a new model_worker implementation named `ipex_llm_worker.py`.

```bash
# On CPU
# Available low_bit format including sym_int4, sym_int8, bf16 etc.
python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "cpu"

# On GPU
# Available low_bit format including sym_int4, sym_int8, fp16 etc.
source /opt/intel/oneapi/setvars.sh
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1

python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "xpu"
```

#### For self-speculative decoding example:

You can use IPEX-LLM to run `self-speculative decoding` example. Refer to [here](https://github.com/intel-analytics/ipex-llm/tree/c9fac8c26bf1e1e8f7376fa9a62b32951dd9e85d/python/llm/example/GPU/Speculative-Decoding) for more details on intel MAX GPUs. Refer to [here](https://github.com/intel-analytics/ipex-llm/tree/c9fac8c26bf1e1e8f7376fa9a62b32951dd9e85d/python/llm/example/GPU/Speculative-Decoding) for more details on intel CPUs.

```bash
# Available low_bit format only including bf16 on CPU.
source ipex-llm-init -t
python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path lmsys/vicuna-7b-v1.5 --low-bit "bf16" --trust-remote-code --device "cpu" --speculative

# Available low_bit format only including fp16 on GPU.
source /opt/intel/oneapi/setvars.sh
export ENABLE_SDP_FUSION=1
export SYCL_CACHE_PERSISTENT=1
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path lmsys/vicuna-7b-v1.5 --low-bit "fp16" --trust-remote-code --device "xpu" --speculative
```

You can get output like this:

```bash
2024-04-12 18:18:09 | INFO | ipex_llm.transformers.utils | Converting the current model to sym_int4 format......
2024-04-12 18:18:11 | INFO | model_worker | Register to controller
2024-04-12 18:18:11 | ERROR | stderr | INFO:     Started server process [126133]
2024-04-12 18:18:11 | ERROR | stderr | INFO:     Waiting for application startup.
2024-04-12 18:18:11 | ERROR | stderr | INFO:     Application startup complete.
2024-04-12 18:18:11 | ERROR | stderr | INFO:     Uvicorn running on http://localhost:21002
```

For a full list of accepted arguments, you can refer to the main method of the `ipex_llm_worker.py`

#### IPEX-LLM vLLM worker

We also provide the `vllm_worker` which uses the [vLLM](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/vLLM-Serving) engine for better hardware utilization.

To run using the `vLLM_worker`,  we don't need to change model name, just simply uses the following command:

```bash
# On CPU
python3 -m ipex_llm.serving.fastchat.vllm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --device cpu

# On GPU
source /opt/intel/oneapi/setvars.sh
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
python3 -m ipex_llm.serving.fastchat.vllm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --device xpu
```

### Launch Gradio web server

When you have started the controller and the worker, you can start web server as follows:

```bash
python3 -m fastchat.serve.gradio_web_server
```

This is the user interface that users will interact with.

<a href="https://llm-assets.readthedocs.io/en/latest/_images/fastchat_gradio_web_ui.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/fastchat_gradio_web_ui.png" width=100%; />
</a>

By following these steps, you will be able to serve your models using the web UI with IPEX-LLM as the backend. You can open your browser and chat with a model now.

### Launch TGI Style API server

When you have started the controller and the worker, you can start TGI Style API server as follows:

```bash
python3 -m ipex_llm.serving.fastchat.tgi_api_server --host localhost --port 8000
```
You can use `curl` for observing the output of the api

#### Using /generate API

This is to send a sentence as inputs in the request, and is expected to receive a response containing model-generated answer.

```bash
curl -X POST -H "Content-Type: application/json" -d '{
  "inputs": "What is AI?",
  "parameters": {
    "best_of": 1,
    "decoder_input_details": true,
    "details": true,
    "do_sample": true,
    "frequency_penalty": 0.1,
    "grammar": {
      "type": "json",
      "value": "string"
    },
    "max_new_tokens": 32,
    "repetition_penalty": 1.03,
    "return_full_text": false,
    "seed": 0.1,
    "stop": [
      "photographer"
    ],
    "temperature": 0.5,
    "top_k": 10,
    "top_n_tokens": 5,
    "top_p": 0.95,
    "truncate": true,
    "typical_p": 0.95,
    "watermark": true
  }
}' http://localhost:8000/generate
```

Sample output:
```bash
{
    "details": {
        "best_of_sequences": [
            {
                "index": 0,
                "message": {
                    "role": "assistant",
                    "content": "\nArtificial Intelligence (AI) is a branch of computer science that attempts to simulate the way that the human brain works. It is a branch of computer "
                },
                "finish_reason": "length",
                "generated_text": "\nArtificial Intelligence (AI) is a branch of computer science that attempts to simulate the way that the human brain works. It is a branch of computer ",
                "generated_tokens": 31
            }
        ]
    },
    "generated_text": "\nArtificial Intelligence (AI) is a branch of computer science that attempts to simulate the way that the human brain works. It is a branch of computer ",
    "usage": {
        "prompt_tokens": 4,
        "total_tokens": 35,
        "completion_tokens": 31
    }
}
```

#### Using /generate_stream API

This is to send a sentence as inputs in the request, and a long connection will be opened to continuously receive multiple responses containing model-generated answer.

```bash
curl -X POST -H "Content-Type: application/json" -d '{
  "inputs": "What is AI?",
  "parameters": {
    "best_of": 1,
    "decoder_input_details": true,
    "details": true,
    "do_sample": true,
    "frequency_penalty": 0.1,
    "grammar": {
      "type": "json",
      "value": "string"
    },
    "max_new_tokens": 32,
    "repetition_penalty": 1.03,
    "return_full_text": false,
    "seed": 0.1,
    "stop": [
      "photographer"
    ],
    "temperature": 0.5,
    "top_k": 10,
    "top_n_tokens": 5,
    "top_p": 0.95,
    "truncate": true,
    "typical_p": 0.95,
    "watermark": true
  }
}' http://localhost:8000/generate_stream
```

Sample output:
```bash
data: {"token": {"id": 663359, "text": "", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}

data: {"token": {"id": 300560, "text": "\n", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}

data: {"token": {"id": 725120, "text": "Artificial Intelligence ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}

data: {"token": {"id": 734609, "text": "(AI) is ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}

data: {"token": {"id": 362235, "text": "a branch of computer ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}

data: {"token": {"id": 380983, "text": "science that attempts to ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}

data: {"token": {"id": 249979, "text": "simulate the way that ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}

data: {"token": {"id": 972663, "text": "the human brain ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}

data: {"token": {"id": 793301, "text": "works. It is a ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}

data: {"token": {"id": 501380, "text": "branch of computer ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}

data: {"token": {"id": 673232, "text": "", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}

data: {"token": {"id": 2, "text": "</s>", "logprob": 0.0, "special": true}, "generated_text": "\nArtificial Intelligence (AI) is a branch of computer science that attempts to simulate the way that the human brain works. It is a branch of computer ", "details": {"finish_reason": "eos_token", "generated_tokens": 31, "prefill_tokens": 4, "seed": 2023}, "special_ret": {"tensor": []}}
```


### Launch RESTful API server

To start an OpenAI API server that provides compatible APIs using IPEX-LLM backend, you can launch the `openai_api_server` and follow this [doc](https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md) to use it.

When you have started the controller and the worker, you can start RESTful API server as follows:

```bash
python3 -m fastchat.serve.openai_api_server --host localhost --port 8000
```

You can use `curl` for observing the output of the api

You can format the output using `jq`

#### List Models

```bash
curl http://localhost:8000/v1/models | jq
```

Example output

```json

{
  "object": "list",
  "data": [
    {
      "id": "Llama-2-7b-chat-hf",
      "object": "model",
      "created": 1712919071,
      "owned_by": "fastchat",
      "root": "Llama-2-7b-chat-hf",
      "parent": null,
      "permission": [
        {
          "id": "modelperm-XpFyEE7Sewx4XYbEcdbCVz",
          "object": "model_permission",
          "created": 1712919071,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": true,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    }
  ]
}
```

#### Chat Completions

```bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Llama-2-7b-chat-hf",
    "messages": [{"role": "user", "content": "Hello! What is your name?"}]
  }' | jq
```

Example output

```json
{
  "id": "chatcmpl-jJ9vKSGkcDMTxKfLxK7q2x",
  "object": "chat.completion",
  "created": 1712919092,
  "model": "Llama-2-7b-chat-hf",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " Hello! My name is LLaMA, I'm a large language model trained by a team of researcher at Meta AI. Unterscheidung. 😊"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 15,
    "total_tokens": 53,
    "completion_tokens": 38
  }
}

```

#### Text Completions

```bash
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Llama-2-7b-chat-hf",
    "prompt": "Once upon a time",
    "max_tokens": 41,
    "temperature": 0.5
  }' | jq
```

Example Output:

```json
{
  "id": "cmpl-PsAkpTWMmBLzWCTtM4r97Y",
  "object": "text_completion",
  "created": 1712919307,
  "model": "Llama-2-7b-chat-hf",
  "choices": [
    {
      "index": 0,
      "text": ", in a far-off land, there was a magical kingdom called \"Happily Ever Laughter.\" It was a place where laughter was the key to happiness, and everyone who ",
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 45,
    "completion_tokens": 40
  }
}

```