From 1291165720763b62edbf1de4fda1e257a09d842b Mon Sep 17 00:00:00 2001
From: Xiangyu Tian <109123695+xiangyuT@users.noreply.github.com>
Date: Fri, 24 May 2024 10:21:21 +0800
Subject: [PATCH] LLM: Add quickstart for vLLM cpu (#11122)
Add quickstart for vLLM cpu.
---
 .../source/_templates/sidebar_quicklinks.html |   3 +
 docs/readthedocs/source/_toc.yml              |   1 +
 .../source/doc/LLM/DockerGuides/index.rst     |   1 +
 .../vllm_cpu_docker_quickstart.md             | 118 ++++++++++++++++++
 .../DockerGuides/vllm_docker_quickstart.md    |   1 +
 5 files changed, 124 insertions(+)
 create mode 100644 docs/readthedocs/source/doc/LLM/DockerGuides/vllm_cpu_docker_quickstart.md
diff --git a/docs/readthedocs/source/_templates/sidebar_quicklinks.html b/docs/readthedocs/source/_templates/sidebar_quicklinks.html
index d1a58980..d1aed482 100644
--- a/docs/readthedocs/source/_templates/sidebar_quicklinks.html
+++ b/docs/readthedocs/source/_templates/sidebar_quicklinks.html
@@ -95,6 +95,9 @@
                     
                         vLLM with `ipex-llm` on Intel GPU
                     
+                    
+                        vLLM with `ipex-llm` on Intel CPU
+                    
                 
             
             
diff --git a/docs/readthedocs/source/_toc.yml b/docs/readthedocs/source/_toc.yml
index 9f4b3578..0f5383a8 100644
--- a/docs/readthedocs/source/_toc.yml
+++ b/docs/readthedocs/source/_toc.yml
@@ -25,6 +25,7 @@ subtrees:
                 - file: doc/LLM/DockerGuides/docker_cpp_xpu_quickstart
                 - file: doc/LLM/DockerGuides/fastchat_docker_quickstart
                 - file: doc/LLM/DockerGuides/vllm_docker_quickstart
+                - file: doc/LLM/DockerGuides/vllm_cpu_docker_quickstart
           - file: doc/LLM/Quickstart/index
             title: "Quickstart"
             subtrees:
diff --git a/docs/readthedocs/source/doc/LLM/DockerGuides/index.rst b/docs/readthedocs/source/doc/LLM/DockerGuides/index.rst
index 0e6cb976..29781e52 100644
--- a/docs/readthedocs/source/doc/LLM/DockerGuides/index.rst
+++ b/docs/readthedocs/source/doc/LLM/DockerGuides/index.rst
@@ -12,3 +12,4 @@ In this section, you will find guides related to using IPEX-LLM with Docker, cov
 * Serving
    * `FastChat with IPEX-LLM on Intel GPU <./fastchat_docker_quickstart.html>`_
    * `vLLM with IPEX-LLM on Intel GPU <./vllm_docker_quickstart.html>`_
+   * `vLLM with IPEX-LLM on Intel CPU <./vllm_cpu_docker_quickstart.html>`_
diff --git a/docs/readthedocs/source/doc/LLM/DockerGuides/vllm_cpu_docker_quickstart.md b/docs/readthedocs/source/doc/LLM/DockerGuides/vllm_cpu_docker_quickstart.md
new file mode 100644
index 00000000..16d96367
--- /dev/null
+++ b/docs/readthedocs/source/doc/LLM/DockerGuides/vllm_cpu_docker_quickstart.md
@@ -0,0 +1,118 @@
+# Serving using IPEX-LLM integrated vLLM on Intel CPU via docker
+
+This guide demonstrates how to do LLM serving with `IPEX-LLM` integrated `vLLM` in Docker on Linux with Intel CPU.
+
+## Install docker
+
+Follow the instructions in this [guide](https://www.docker.com/get-started/) to install Docker on Linux.
+
+## Pull the latest image
+
+*Note: For running vLLM serving on Intel CPUs, you can currently use either the `intelanalytics/ipex-llm-serving-cpu:latest` or `intelanalytics/ipex-llm-serving-vllm-cpu:latest` Docker image.*
+
+```bash
+# This image will be updated every day
+docker pull intelanalytics/ipex-llm-serving-cpu:latest
+```
+
+## Start Docker Container
+
+To fully use your Intel CPU to run vLLM inference and serving, you should 
+```
+#/bin/bash
+export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-cpu:latest
+export CONTAINER_NAME=ipex-llm-serving-cpu-container
+sudo docker run -itd \
+        --net=host \
+        --cpuset-cpus="0-47" \
+        --cpuset-mems="0" \
+        -v /path/to/models:/llm/models \
+        -e no_proxy=localhost,127.0.0.1 \
+        --memory="64G" \
+        --name=$CONTAINER_NAME \
+        --shm-size="16g" \
+        $DOCKER_IMAGE
+```
+
+After the container is booted, you could get into the container through `docker exec`.
+
+```bash
+docker exec -it ipex-llm-serving-cpu-container /bin/bash
+```
+
+## Running vLLM serving with IPEX-LLM on Intel GPU in Docker
+
+We have included multiple vLLM-related files in `/llm/`:
+1. `vllm_offline_inference.py`: Used for vLLM offline inference example
+2. `benchmark_vllm_throughput.py`: Used for benchmarking throughput
+3. `payload-1024.lua`: Used for testing request per second using 1k-128 request
+4. `start-vllm-service.sh`: Used for template for starting vLLM service
+
+Before performing benchmark or starting the service, you can refer to this [section](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_cpu.html#environment-setup) to setup our recommended runtime configurations.
+
+### Service
+
+A script named `/llm/start-vllm-service.sh` have been included in the image for starting the service conveniently.
+
+Modify the `model` and `served_model_name` in the script so that it fits your requirement. The `served_model_name` indicates the model name used in the API. 
+
+Then start the service using `bash /llm/start-vllm-service.sh`, the following message should be print if the service started successfully.
+
+If the service have booted successfully, you should see the output similar to the following figure:
+
+
+  
+
+
+
+#### Verify
+After the service has been booted successfully, you can send a test request using `curl`. Here, `YOUR_MODEL` should be set equal to `served_model_name` in your booting script, e.g. `Qwen1.5`.
+
+```bash
+curl http://localhost:8000/v1/completions \
+-H "Content-Type: application/json" \
+-d '{
+  "model": "YOUR_MODEL",
+  "prompt": "San Francisco is a",
+  "max_tokens": 128,
+  "temperature": 0
+}' | jq '.choices[0].text'
+```
+
+Below shows an example output using `Qwen1.5-7B-Chat` with low-bit format `sym_int4`:
+
+
+  
+
+
+#### Tuning
+
+You can tune the service using these four arguments:
+- `--max-model-len`
+- `--max-num-batched-token`
+- `--max-num-seq`
+
+You can refer to this [doc](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#service) for a detailed explaination on these parameters.
+
+### Benchmark
+
+#### Online benchmark throurgh api_server
+
+We can benchmark the api_server to get an estimation about TPS (transactions per second).  To do so, you need to start the service first according to the instructions mentioned above.
+
+Then in the container, do the following:
+1. modify the `/llm/payload-1024.lua` so that the "model" attribute is correct.  By default, we use a prompt that is roughly 1024 token long, you can change it if needed.
+2. Start the benchmark using `wrk` using the script below:
+
+```bash
+cd /llm
+# warmup
+wrk -t4 -c4 -d3m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h
+# You can change -t and -c to control the concurrency.
+# By default, we use 8 connections to benchmark the service.
+wrk -t8 -c8 -d15m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h
+```
+
+#### Offline benchmark through benchmark_vllm_throughput.py
+
+Please refer to this [section](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#performing-benchmark) on how to use `benchmark_vllm_throughput.py` for benchmarking.
diff --git a/docs/readthedocs/source/doc/LLM/DockerGuides/vllm_docker_quickstart.md b/docs/readthedocs/source/doc/LLM/DockerGuides/vllm_docker_quickstart.md
index e6387919..80f9ba65 100644
--- a/docs/readthedocs/source/doc/LLM/DockerGuides/vllm_docker_quickstart.md
+++ b/docs/readthedocs/source/doc/LLM/DockerGuides/vllm_docker_quickstart.md
@@ -8,6 +8,7 @@ Follow the instructions in this [guide](https://ipex-llm.readthedocs.io/en/lates
 
 ## Pull the latest image
 
+*Note: For running vLLM serving on Intel GPUs, you can currently use either the `intelanalytics/ipex-llm-serving-xpu:latest` or `intelanalytics/ipex-llm-serving-vllm-xpu:latest` Docker image.*
 ```bash
 # This image will be updated every day
 docker pull intelanalytics/ipex-llm-serving-xpu:latest