Update mddocs for DockerGuides (#11380)

* transfer files in DockerGuides from rst to md

* add some dividing lines

* adjust the title hierarchy in docker_cpp_xpu_quickstart.md

* restore

* switch to the correct branch

* small change

---------

Co-authored-by: ATMxsp01 <shou.xu@intel.com>
This commit is contained in:
Xu, Shuo 2024-06-21 12:10:35 +08:00 committed by GitHub
parent 1a1a97c9e4
commit fed79f106b
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
9 changed files with 134 additions and 148 deletions

View file

@ -1,4 +1,4 @@
## Run llama.cpp/Ollama/Open-WebUI on an Intel GPU via Docker
# Run llama.cpp/Ollama/Open-WebUI on an Intel GPU via Docker
## Quick Start
@ -6,11 +6,11 @@
1. Linux Installation
Follow the instructions in this [guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/docker_windows_gpu.html#linux) to install Docker on Linux.
Follow the instructions in this [guide](./docker_windows_gpu.md#linux) to install Docker on Linux.
2. Windows Installation
For Windows installation, refer to this [guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/docker_windows_gpu.html#install-docker-desktop-for-windows).
For Windows installation, refer to this [guide](./docker_windows_gpu.md#install-docker-desktop-for-windows).
#### Setting Docker on windows
@ -24,18 +24,18 @@ docker pull intelanalytics/ipex-llm-inference-cpp-xpu:latest
### Start Docker Container
```eval_rst
.. tabs::
.. tab:: Linux
Choose one of the following methods to start the container:
To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. Select the device you are running(device type:(Max, Flex, Arc, iGPU)). And change the `/path/to/models` to mount the models. `bench_model` is used to benchmark quickly. If want to benchmark, make sure it on the `/path/to/models`
<details>
<Summary>For <strong>Linux</strong>:</summary>
.. code-block:: bash
To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. Select the device you are running(device type:(Max, Flex, Arc, iGPU)). And change the `/path/to/models` to mount the models. `bench_model` is used to benchmark quickly. If want to benchmark, make sure it on the `/path/to/models`
#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-inference-cpp-xpu:latest
export CONTAINER_NAME=ipex-llm-inference-cpp-xpu-container
sudo docker run -itd \
```bash
#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-inference-cpp-xpu:latest
export CONTAINER_NAME=ipex-llm-inference-cpp-xpu-container
sudo docker run -itd \
--net=host \
--device=/dev/dri \
-v /path/to/models:/models \
@ -46,17 +46,19 @@ docker pull intelanalytics/ipex-llm-inference-cpp-xpu:latest
-e DEVICE=Arc \
--shm-size="16g" \
$DOCKER_IMAGE
.. tab:: Windows
```
</details>
To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. And change the `/path/to/models` to mount the models. Then add `--privileged` and map the `/usr/lib/wsl` to the docker.
<details>
<summary>For <strong>Windows</strong>:</summary>
.. code-block:: bash
To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. And change the `/path/to/models` to mount the models. Then add `--privileged` and map the `/usr/lib/wsl` to the docker.
#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-inference-cpp-xpu:latest
export CONTAINER_NAME=ipex-llm-inference-cpp-xpu-container
sudo docker run -itd \
```bash
#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-inference-cpp-xpu:latest
export CONTAINER_NAME=ipex-llm-inference-cpp-xpu-container
sudo docker run -itd \
--net=host \
--device=/dev/dri \
--privileged \
@ -69,9 +71,10 @@ docker pull intelanalytics/ipex-llm-inference-cpp-xpu:latest
-e DEVICE=Arc \
--shm-size="16g" \
$DOCKER_IMAGE
```
</details>
```
---
After the container is booted, you could get into the container through `docker exec`.
@ -126,7 +129,7 @@ llama_print_timings: eval time = xxx ms / 31 runs ( xxx ms per
llama_print_timings: total time = xxx ms / xxx tokens
```
Please refer to this [documentation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html) for more details.
Please refer to this [documentation](../Quickstart/llama_cpp_quickstart.md) for more details.
### Running Ollama serving with IPEX-LLM on Intel GPU
@ -194,13 +197,13 @@ Sample output:
```
Please refer to this [documentation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/ollama_quickstart.html#pull-model) for more details.
Please refer to this [documentation](../Quickstart/ollama_quickstart.md#4-pull-model) for more details.
### Running Open WebUI with Intel GPU
Start the ollama and load the model first, then use the open-webui to chat.
If you have difficulty accessing the huggingface repositories, you may use a mirror, e.g. add export HF_ENDPOINT=https://hf-mirror.com before running bash start.sh.
If you have difficulty accessing the huggingface repositories, you may use a mirror, e.g. add `export HF_ENDPOINT=https://hf-mirror.com`before running bash start.sh.
```bash
cd /llm/scripts/
bash start-open-webui.sh
@ -218,4 +221,4 @@ INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
<img src="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_signup.png" width="100%" />
</a>
For how to log-in or other guide, Please refer to this [documentation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/open_webui_with_ollama_quickstart.html) for more details.
For how to log-in or other guide, Please refer to this [documentation](../Quickstart/open_webui_with_ollama_quickstart.md) for more details.

View file

@ -2,16 +2,12 @@
We can run PyTorch Inference Benchmark, Chat Service and PyTorch Examples on Intel GPUs within Docker (on Linux or WSL).
```eval_rst
.. note::
The current Windows + WSL + Docker solution only supports Arc series dGPU. For Windows users with MTL iGPU, it is recommended to install directly via pip install in Miniforge Prompt. Refer to `this guide <https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_windows_gpu.html>`_.
```
> [!NOTE]
> The current Windows + WSL + Docker solution only supports Arc series dGPU. For Windows users with MTL iGPU, it is recommended to install directly via pip install in Miniforge Prompt. Refer to [this guide](../Quickstart/install_windows_gpu.md).
## Install Docker
Follow the [Docker installation Guide](./docker_windows_gpu.html#install-docker) to install docker on either Linux or Windows.
Follow the [Docker installation Guide](./docker_windows_gpu.md#install-docker) to install docker on either Linux or Windows.
## Launch Docker
@ -20,19 +16,17 @@ Prepare ipex-llm-xpu Docker Image:
docker pull intelanalytics/ipex-llm-xpu:latest
```
Start ipex-llm-xpu Docker Container:
Start ipex-llm-xpu Docker Container. Choose one of the following commands to start the container:
```eval_rst
.. tabs::
.. tab:: Linux
<details>
<summary>For <strong>Linux</strong>:</summary>
.. code-block:: bash
```bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest
export CONTAINER_NAME=my_container
export MODEL_PATH=/llm/models[change to your model path]
export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest
export CONTAINER_NAME=my_container
export MODEL_PATH=/llm/models[change to your model path]
docker run -itd \
docker run -itd \
--net=host \
--device=/dev/dri \
--memory="32G" \
@ -40,17 +34,19 @@ Start ipex-llm-xpu Docker Container:
--shm-size="16g" \
-v $MODEL_PATH:/llm/models \
$DOCKER_IMAGE
```
</details>
.. tab:: Windows WSL
<details>
<summary>For <strong>Windows WSL</strong>:</summary>
.. code-block:: bash
```bash
#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest
export CONTAINER_NAME=my_container
export MODEL_PATH=/llm/models[change to your model path]
#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest
export CONTAINER_NAME=my_container
export MODEL_PATH=/llm/models[change to your model path]
sudo docker run -itd \
sudo docker run -itd \
--net=host \
--privileged \
--device /dev/dri \
@ -60,8 +56,10 @@ Start ipex-llm-xpu Docker Container:
-v $MODEL_PATH:/llm/llm-models \
-v /usr/lib/wsl:/usr/lib/wsl \
$DOCKER_IMAGE
```
```
</details>
---
Access the container:
```
@ -77,18 +75,13 @@ root@arda-arc12:/# sycl-ls
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
```
```eval_rst
.. tip::
You can run the Env-Check script to verify your ipex-llm installation and runtime environment.
.. code-block:: bash
cd /ipex-llm/python/llm/scripts
bash env-check.sh
```
> [!TIP]
> You can run the Env-Check script to verify your ipex-llm installation and runtime environment.
>
> ```bash
> cd /ipex-llm/python/llm/scripts
> bash env-check.sh
> ```
## Run Inference Benchmark

View file

@ -4,21 +4,18 @@ An IPEX-LLM container is a pre-configured environment that includes all necessar
This guide provides steps to run/develop PyTorch examples in VSCode with Docker on Intel GPUs.
```eval_rst
.. note::
This guide assumes you have already installed VSCode in your environment.
To run/develop on Windows, install VSCode and then follow the steps below.
To run/develop on Linux, you might open VSCode first and SSH to a remote Linux machine, then proceed with the following steps.
```
> [!note]
> This guide assumes you have already installed VSCode in your environment.
>
> To run/develop on Windows, install VSCode and then follow the steps below.
>
> To run/develop on Linux, you might open VSCode first and SSH to a remote Linux machine, then proceed with the following steps.
## Install Docker
Follow the [Docker installation Guide](./docker_windows_gpu.html#install-docker) to install docker on either Linux or Windows.
Follow the [Docker installation Guide](./docker_windows_gpu.md#install-docker) to install docker on either Linux or Windows.
## Install Extensions for VSCcode
@ -52,19 +49,18 @@ Open the Terminal in VSCode (you can use the shortcut `` Ctrl+Shift+` ``), then
docker pull intelanalytics/ipex-llm-xpu:latest
```
Start ipex-llm-xpu Docker Container:
Start ipex-llm-xpu Docker Container. Choose one of the following commands to start the container:
```eval_rst
.. tabs::
.. tab:: Linux
<details>
<summary>For <strong>Linux</strong>:</summary>
.. code-block:: bash
```bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest
export CONTAINER_NAME=my_container
export MODEL_PATH=/llm/models[change to your model path]
export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest
export CONTAINER_NAME=my_container
export MODEL_PATH=/llm/models[change to your model path]
docker run -itd \
docker run -itd \
--net=host \
--device=/dev/dri \
--memory="32G" \
@ -72,17 +68,19 @@ Start ipex-llm-xpu Docker Container:
--shm-size="16g" \
-v $MODEL_PATH:/llm/models \
$DOCKER_IMAGE
```
</details>
.. tab:: Windows WSL
<details>
<summary>For <strong>Windows WSL</strong>:</summary>
.. code-block:: bash
```bash
#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest
export CONTAINER_NAME=my_container
export MODEL_PATH=/llm/models[change to your model path]
#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest
export CONTAINER_NAME=my_container
export MODEL_PATH=/llm/models[change to your model path]
sudo docker run -itd \
sudo docker run -itd \
--net=host \
--privileged \
--device /dev/dri \
@ -92,8 +90,10 @@ Start ipex-llm-xpu Docker Container:
-v $MODEL_PATH:/llm/llm-models \
-v /usr/lib/wsl:/usr/lib/wsl \
$DOCKER_IMAGE
```
```
</details>
---
## Run/Develop Pytorch Examples

View file

@ -14,18 +14,12 @@ Follow the instructions in the [Offcial Docker Guide](https://www.docker.com/get
### Windows
```eval_rst
.. tip::
> [!TIP]
> The installation requires at least 35GB of free disk space on C drive.
The installation requires at least 35GB of free disk space on C drive.
> [!NOTE]
> Detailed installation instructions for Windows, including steps for enabling WSL2, can be found on the [Docker Desktop for Windows installation page](https://docs.docker.com/desktop/install/windows-install/).
```
```eval_rst
.. note::
Detailed installation instructions for Windows, including steps for enabling WSL2, can be found on the [Docker Desktop for Windows installation page](https://docs.docker.com/desktop/install/windows-install/).
```
#### Install Docker Desktop for Windows
Follow the instructions in [this guide](https://docs.docker.com/desktop/install/windows-install/) to install **Docker Desktop for Windows**. Restart you machine after the installation is complete.
@ -34,11 +28,9 @@ Follow the instructions in [this guide](https://docs.docker.com/desktop/install/
Follow the instructions in [this guide](https://docs.microsoft.com/en-us/windows/wsl/install) to install **Windows Subsystem for Linux 2 (WSL2)**.
```eval_rst
.. tip::
> [!TIP]
> You may verify WSL2 installation by running the command `wsl --list` in PowerShell or Command Prompt. If WSL2 is installed, you will see a list of installed Linux distributions.
You may verify WSL2 installation by running the command `wsl --list` in PowerShell or Command Prompt. If WSL2 is installed, you will see a list of installed Linux distributions.
```
#### Enable Docker integration with WSL2
@ -47,11 +39,10 @@ Open **Docker desktop**, and select `Settings`->`Resources`->`WSL integration`->
<img src="https://llm-assets.readthedocs.io/en/latest/_images/docker_desktop_new.png" width=100%; />
</a>
```eval_rst
.. tip::
If you encounter **Docker Engine stopped** when opening Docker Desktop, you can reopen it in administrator mode.
```
> [!TIP]
> If you encounter **Docker Engine stopped** when opening Docker Desktop, you can reopen it in administrator mode.
#### Verify Docker is enabled in WSL2
@ -67,11 +58,9 @@ You can see the output similar to the following:
<img src="https://llm-assets.readthedocs.io/en/latest/_images/docker_wsl.png" width=100%; />
</a>
```eval_rst
.. tip::
During the use of Docker in WSL, Docker Desktop needs to be kept open all the time.
```
> [!TIP]
> During the use of Docker in WSL, Docker Desktop needs to be kept open all the time.
## IPEX-LLM Docker Containers
@ -89,7 +78,7 @@ We have several docker images available for running LLMs on Intel GPUs. The foll
| intelanalytics/ipex-llm-finetune-qlora-xpu:latest| GPU Finetuning|For fine-tuning LLMs using QLora/Lora, etc.|
We have also provided several quickstarts for various usage scenarios:
- [Run and develop LLM applications in PyTorch](./docker_pytorch_inference_gpu.html)
- [Run and develop LLM applications in PyTorch](./docker_pytorch_inference_gpu.md)
... to be added soon.

View file

@ -4,7 +4,7 @@ This guide demonstrates how to run `FastChat` serving with `IPEX-LLM` on Intel G
## Install docker
Follow the instructions in this [guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/docker_windows_gpu.html#linux) to install Docker on Linux.
Follow the instructions in this [guide](./docker_windows_gpu.md#linux) to install Docker on Linux.
## Pull the latest image
@ -17,7 +17,7 @@ docker pull intelanalytics/ipex-llm-serving-xpu:latest
To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. Change the `/path/to/models` to mount the models.
```
```bash
#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest
export CONTAINER_NAME=ipex-llm-serving-xpu-container
@ -54,9 +54,9 @@ root@arda-arc12:/# sycl-ls
For convenience, we have provided a script named `/llm/start-fastchat-service.sh` for you to start the service.
However, the script only provide instructions for the most common scenarios. If this script doesn't meet your needs, you can always find the complete guidance for FastChat at [Serving using IPEX-LLM and FastChat](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/fastchat_quickstart.html#start-the-service).
However, the script only provide instructions for the most common scenarios. If this script doesn't meet your needs, you can always find the complete guidance for FastChat at [Serving using IPEX-LLM and FastChat](../Quickstart/fastchat_quickstart.md#2-start-the-service).
Before starting the service, you can refer to this [section](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#runtime-configurations) to setup our recommended runtime configurations.
Before starting the service, you can refer to this [section](../Quickstart/install_linux_gpu.md#runtime-configurations) to setup our recommended runtime configurations.
Now we can start the FastChat service, you can use our provided script `/llm/start-fastchat-service.sh` like the following way:
@ -105,10 +105,10 @@ The `vllm_worker` may start slowly than normal `ipex_llm_worker`. The booted se
</a>
```eval_rst
.. note::
To verify/use the service booted by the script, follow the instructions in `this guide <https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/fastchat_quickstart.html#launch-restful-api-serve>`_.
```
> [!note]
> To verify/use the service booted by the script, follow the instructions in [this guide](../Quickstart/fastchat_quickstart.md#launch-restful-api-server).
After a request has been sent to the `openai_api_server`, the corresponding inference time latency can be found in the worker log as shown below:

View file

@ -0,0 +1,16 @@
# IPEX-LLM Docker Container User Guides
In this section, you will find guides related to using IPEX-LLM with Docker, covering how to:
- [Overview of IPEX-LLM Containers](./docker_windows_gpu.md)
- Inference in Python/C++
- [GPU Inference in Python with IPEX-LLM](./docker_pytorch_inference_gpu.md)
- [VSCode LLM Development with IPEX-LLM on Intel GPU](./docker_run_pytorch_inference_in_vscode.md)
- [llama.cpp/Ollama/Open-WebUI with IPEX-LLM on Intel GPU](./docker_cpp_xpu_quickstart.md)
- Serving
- [FastChat with IPEX-LLM on Intel GPU](./fastchat_docker_quickstart.md)
- [vLLM with IPEX-LLM on Intel GPU](./vllm_docker_quickstart.md)
- [vLLM with IPEX-LLM on Intel CPU](./vllm_cpu_docker_quickstart.md)

View file

@ -1,15 +0,0 @@
IPEX-LLM Docker Container User Guides
=====================================
In this section, you will find guides related to using IPEX-LLM with Docker, covering how to:
* `Overview of IPEX-LLM Containers <./docker_windows_gpu.html>`_
* Inference in Python/C++
* `GPU Inference in Python with IPEX-LLM <./docker_pytorch_inference_gpu.html>`_
* `VSCode LLM Development with IPEX-LLM on Intel GPU <./docker_pytorch_inference_gpu.html>`_
* `llama.cpp/Ollama/Open-WebUI with IPEX-LLM on Intel GPU <./docker_cpp_xpu_quickstart.html>`_
* Serving
* `FastChat with IPEX-LLM on Intel GPU <./fastchat_docker_quickstart.html>`_
* `vLLM with IPEX-LLM on Intel GPU <./vllm_docker_quickstart.html>`_
* `vLLM with IPEX-LLM on Intel CPU <./vllm_cpu_docker_quickstart.html>`_

View file

@ -18,7 +18,7 @@ docker pull intelanalytics/ipex-llm-serving-cpu:latest
## Start Docker Container
To fully use your Intel CPU to run vLLM inference and serving, you should
```
```bash
#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-cpu:latest
export CONTAINER_NAME=ipex-llm-serving-cpu-container
@ -48,7 +48,7 @@ We have included multiple vLLM-related files in `/llm/`:
3. `payload-1024.lua`: Used for testing request per second using 1k-128 request
4. `start-vllm-service.sh`: Used for template for starting vLLM service
Before performing benchmark or starting the service, you can refer to this [section](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_cpu.html#environment-setup) to setup our recommended runtime configurations.
Before performing benchmark or starting the service, you can refer to this [section](../Overview/install_cpu.md#environment-setup) to setup our recommended runtime configurations.
### Service
@ -92,7 +92,7 @@ You can tune the service using these four arguments:
- `--max-num-batched-token`
- `--max-num-seq`
You can refer to this [doc](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#service) for a detailed explaination on these parameters.
You can refer to this [doc](../Quickstart/vLLM_quickstart.md#service) for a detailed explaination on these parameters.
### Benchmark
@ -115,4 +115,4 @@ wrk -t8 -c8 -d15m -s payload-1024.lua http://localhost:8000/v1/completions --tim
#### Offline benchmark through benchmark_vllm_throughput.py
Please refer to this [section](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#performing-benchmark) on how to use `benchmark_vllm_throughput.py` for benchmarking.
Please refer to this [section](../Quickstart/vLLM_quickstart.md#5performing-benchmark) on how to use `benchmark_vllm_throughput.py` for benchmarking.

View file

@ -4,7 +4,7 @@ This guide demonstrates how to run `vLLM` serving with `IPEX-LLM` on Intel GPUs
## Install docker
Follow the instructions in this [guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/docker_windows_gpu.html#linux) to install Docker on Linux.
Follow the instructions in this [guide](./docker_windows_gpu.md#linux) to install Docker on Linux.
## Pull the latest image
@ -18,7 +18,7 @@ docker pull intelanalytics/ipex-llm-serving-xpu:latest
To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. Change the `/path/to/models` to mount the models.
```
```bash
#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest
export CONTAINER_NAME=ipex-llm-serving-xpu-container
@ -58,7 +58,7 @@ We have included multiple vLLM-related files in `/llm/`:
3. `payload-1024.lua`: Used for testing request per second using 1k-128 request
4. `start-vllm-service.sh`: Used for template for starting vLLM service
Before performing benchmark or starting the service, you can refer to this [section](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#runtime-configurations) to setup our recommended runtime configurations.
Before performing benchmark or starting the service, you can refer to this [section](../Quickstart/install_linux_gpu.md#runtime-configurations) to setup our recommended runtime configurations.
### Service
@ -82,7 +82,7 @@ If the service have booted successfully, you should see the output similar to th
vLLM supports to utilize multiple cards through tensor parallel.
You can refer to this [documentation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#about-tensor-parallel) on how to utilize the `tensor-parallel` feature and start the service.
You can refer to this [documentation](../Quickstart/vLLM_quickstart.md#4-about-tensor-parallel) on how to utilize the `tensor-parallel` feature and start the service.
#### Verify
After the service has been booted successfully, you can send a test request using `curl`. Here, `YOUR_MODEL` should be set equal to `served_model_name` in your booting script, e.g. `Qwen1.5`.
@ -113,7 +113,7 @@ You can tune the service using these four arguments:
- `--max-num-batched-token`
- `--max-num-seq`
You can refer to this [doc](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#service) for a detailed explaination on these parameters.
You can refer to this [doc](../Quickstart/vLLM_quickstart.md#service) for a detailed explaination on these parameters.
### Benchmark
@ -143,4 +143,4 @@ The following figure shows performing benchmark on `Llama-2-7b-chat-hf` using th
#### Offline benchmark through benchmark_vllm_throughput.py
Please refer to this [section](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#performing-benchmark) on how to use `benchmark_vllm_throughput.py` for benchmarking.
Please refer to this [section](../Quickstart/vLLM_quickstart.md#5performing-benchmark) on how to use `benchmark_vllm_throughput.py` for benchmarking.