From fed79f106bc05492d19bb56e7ea6ff5d06eb56f7 Mon Sep 17 00:00:00 2001 From: "Xu, Shuo" <100334393+ATMxsp01@users.noreply.github.com> Date: Fri, 21 Jun 2024 12:10:35 +0800 Subject: [PATCH] Update mddocs for DockerGuides (#11380) * transfer files in DockerGuides from rst to md * add some dividing lines * adjust the title hierarchy in docker_cpp_xpu_quickstart.md * restore * switch to the correct branch * small change --------- Co-authored-by: ATMxsp01 --- .../DockerGuides/docker_cpp_xpu_quickstart.md | 55 ++++++++------- .../docker_pytorch_inference_gpu.md | 69 +++++++++---------- .../docker_run_pytorch_inference_in_vscode.md | 56 +++++++-------- .../mddocs/DockerGuides/docker_windows_gpu.md | 35 ++++------ .../fastchat_docker_quickstart.md | 16 ++--- docs/mddocs/DockerGuides/index.md | 16 +++++ docs/mddocs/DockerGuides/index.rst | 15 ---- .../vllm_cpu_docker_quickstart.md | 8 +-- .../DockerGuides/vllm_docker_quickstart.md | 12 ++-- 9 files changed, 134 insertions(+), 148 deletions(-) create mode 100644 docs/mddocs/DockerGuides/index.md delete mode 100644 docs/mddocs/DockerGuides/index.rst diff --git a/docs/mddocs/DockerGuides/docker_cpp_xpu_quickstart.md b/docs/mddocs/DockerGuides/docker_cpp_xpu_quickstart.md index 92156d25..85a6b9ad 100644 --- a/docs/mddocs/DockerGuides/docker_cpp_xpu_quickstart.md +++ b/docs/mddocs/DockerGuides/docker_cpp_xpu_quickstart.md @@ -1,4 +1,4 @@ -## Run llama.cpp/Ollama/Open-WebUI on an Intel GPU via Docker +# Run llama.cpp/Ollama/Open-WebUI on an Intel GPU via Docker ## Quick Start @@ -6,11 +6,11 @@ 1. Linux Installation - Follow the instructions in this [guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/docker_windows_gpu.html#linux) to install Docker on Linux. + Follow the instructions in this [guide](./docker_windows_gpu.md#linux) to install Docker on Linux. 2. Windows Installation - For Windows installation, refer to this [guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/docker_windows_gpu.html#install-docker-desktop-for-windows). + For Windows installation, refer to this [guide](./docker_windows_gpu.md#install-docker-desktop-for-windows). #### Setting Docker on windows @@ -24,18 +24,18 @@ docker pull intelanalytics/ipex-llm-inference-cpp-xpu:latest ### Start Docker Container -```eval_rst -.. tabs:: - .. tab:: Linux +Choose one of the following methods to start the container: - To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. Select the device you are running(device type:(Max, Flex, Arc, iGPU)). And change the `/path/to/models` to mount the models. `bench_model` is used to benchmark quickly. If want to benchmark, make sure it on the `/path/to/models` +
+For Linux: - .. code-block:: bash + To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. Select the device you are running(device type:(Max, Flex, Arc, iGPU)). And change the `/path/to/models` to mount the models. `bench_model` is used to benchmark quickly. If want to benchmark, make sure it on the `/path/to/models` - #/bin/bash - export DOCKER_IMAGE=intelanalytics/ipex-llm-inference-cpp-xpu:latest - export CONTAINER_NAME=ipex-llm-inference-cpp-xpu-container - sudo docker run -itd \ + ```bash + #/bin/bash + export DOCKER_IMAGE=intelanalytics/ipex-llm-inference-cpp-xpu:latest + export CONTAINER_NAME=ipex-llm-inference-cpp-xpu-container + sudo docker run -itd \ --net=host \ --device=/dev/dri \ -v /path/to/models:/models \ @@ -46,17 +46,19 @@ docker pull intelanalytics/ipex-llm-inference-cpp-xpu:latest -e DEVICE=Arc \ --shm-size="16g" \ $DOCKER_IMAGE - - .. tab:: Windows + ``` +
- To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. And change the `/path/to/models` to mount the models. Then add `--privileged` and map the `/usr/lib/wsl` to the docker. +
+For Windows: - .. code-block:: bash + To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. And change the `/path/to/models` to mount the models. Then add `--privileged` and map the `/usr/lib/wsl` to the docker. - #/bin/bash - export DOCKER_IMAGE=intelanalytics/ipex-llm-inference-cpp-xpu:latest - export CONTAINER_NAME=ipex-llm-inference-cpp-xpu-container - sudo docker run -itd \ + ```bash + #/bin/bash + export DOCKER_IMAGE=intelanalytics/ipex-llm-inference-cpp-xpu:latest + export CONTAINER_NAME=ipex-llm-inference-cpp-xpu-container + sudo docker run -itd \ --net=host \ --device=/dev/dri \ --privileged \ @@ -69,9 +71,10 @@ docker pull intelanalytics/ipex-llm-inference-cpp-xpu:latest -e DEVICE=Arc \ --shm-size="16g" \ $DOCKER_IMAGE + ``` +
-``` - +--- After the container is booted, you could get into the container through `docker exec`. @@ -126,7 +129,7 @@ llama_print_timings: eval time = xxx ms / 31 runs ( xxx ms per llama_print_timings: total time = xxx ms / xxx tokens ``` -Please refer to this [documentation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html) for more details. +Please refer to this [documentation](../Quickstart/llama_cpp_quickstart.md) for more details. ### Running Ollama serving with IPEX-LLM on Intel GPU @@ -194,13 +197,13 @@ Sample output: ``` -Please refer to this [documentation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/ollama_quickstart.html#pull-model) for more details. +Please refer to this [documentation](../Quickstart/ollama_quickstart.md#4-pull-model) for more details. ### Running Open WebUI with Intel GPU Start the ollama and load the model first, then use the open-webui to chat. -If you have difficulty accessing the huggingface repositories, you may use a mirror, e.g. add export HF_ENDPOINT=https://hf-mirror.com before running bash start.sh. +If you have difficulty accessing the huggingface repositories, you may use a mirror, e.g. add `export HF_ENDPOINT=https://hf-mirror.com`before running bash start.sh. ```bash cd /llm/scripts/ bash start-open-webui.sh @@ -218,4 +221,4 @@ INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit) -For how to log-in or other guide, Please refer to this [documentation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/open_webui_with_ollama_quickstart.html) for more details. +For how to log-in or other guide, Please refer to this [documentation](../Quickstart/open_webui_with_ollama_quickstart.md) for more details. diff --git a/docs/mddocs/DockerGuides/docker_pytorch_inference_gpu.md b/docs/mddocs/DockerGuides/docker_pytorch_inference_gpu.md index 76409384..a4199b54 100644 --- a/docs/mddocs/DockerGuides/docker_pytorch_inference_gpu.md +++ b/docs/mddocs/DockerGuides/docker_pytorch_inference_gpu.md @@ -2,16 +2,12 @@ We can run PyTorch Inference Benchmark, Chat Service and PyTorch Examples on Intel GPUs within Docker (on Linux or WSL). -```eval_rst -.. note:: - - The current Windows + WSL + Docker solution only supports Arc series dGPU. For Windows users with MTL iGPU, it is recommended to install directly via pip install in Miniforge Prompt. Refer to `this guide `_. - -``` +> [!NOTE] +> The current Windows + WSL + Docker solution only supports Arc series dGPU. For Windows users with MTL iGPU, it is recommended to install directly via pip install in Miniforge Prompt. Refer to [this guide](../Quickstart/install_windows_gpu.md). ## Install Docker -Follow the [Docker installation Guide](./docker_windows_gpu.html#install-docker) to install docker on either Linux or Windows. +Follow the [Docker installation Guide](./docker_windows_gpu.md#install-docker) to install docker on either Linux or Windows. ## Launch Docker @@ -20,19 +16,17 @@ Prepare ipex-llm-xpu Docker Image: docker pull intelanalytics/ipex-llm-xpu:latest ``` -Start ipex-llm-xpu Docker Container: +Start ipex-llm-xpu Docker Container. Choose one of the following commands to start the container: -```eval_rst -.. tabs:: - .. tab:: Linux +
+For Linux: - .. code-block:: bash + ```bash + export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest + export CONTAINER_NAME=my_container + export MODEL_PATH=/llm/models[change to your model path] - export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest - export CONTAINER_NAME=my_container - export MODEL_PATH=/llm/models[change to your model path] - - docker run -itd \ + docker run -itd \ --net=host \ --device=/dev/dri \ --memory="32G" \ @@ -40,17 +34,19 @@ Start ipex-llm-xpu Docker Container: --shm-size="16g" \ -v $MODEL_PATH:/llm/models \ $DOCKER_IMAGE + ``` +
- .. tab:: Windows WSL +
+For Windows WSL: - .. code-block:: bash + ```bash + #/bin/bash + export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest + export CONTAINER_NAME=my_container + export MODEL_PATH=/llm/models[change to your model path] - #/bin/bash - export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest - export CONTAINER_NAME=my_container - export MODEL_PATH=/llm/models[change to your model path] - - sudo docker run -itd \ + sudo docker run -itd \ --net=host \ --privileged \ --device /dev/dri \ @@ -60,8 +56,10 @@ Start ipex-llm-xpu Docker Container: -v $MODEL_PATH:/llm/llm-models \ -v /usr/lib/wsl:/usr/lib/wsl \ $DOCKER_IMAGE -``` + ``` +
+--- Access the container: ``` @@ -77,18 +75,13 @@ root@arda-arc12:/# sycl-ls [ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241] ``` -```eval_rst -.. tip:: - - You can run the Env-Check script to verify your ipex-llm installation and runtime environment. - - .. code-block:: bash - - cd /ipex-llm/python/llm/scripts - bash env-check.sh - - -``` +> [!TIP] +> You can run the Env-Check script to verify your ipex-llm installation and runtime environment. +> +> ```bash +> cd /ipex-llm/python/llm/scripts +> bash env-check.sh +> ``` ## Run Inference Benchmark diff --git a/docs/mddocs/DockerGuides/docker_run_pytorch_inference_in_vscode.md b/docs/mddocs/DockerGuides/docker_run_pytorch_inference_in_vscode.md index 9a07609d..8652d396 100644 --- a/docs/mddocs/DockerGuides/docker_run_pytorch_inference_in_vscode.md +++ b/docs/mddocs/DockerGuides/docker_run_pytorch_inference_in_vscode.md @@ -4,21 +4,18 @@ An IPEX-LLM container is a pre-configured environment that includes all necessar This guide provides steps to run/develop PyTorch examples in VSCode with Docker on Intel GPUs. -```eval_rst -.. note:: - This guide assumes you have already installed VSCode in your environment. - - To run/develop on Windows, install VSCode and then follow the steps below. - - To run/develop on Linux, you might open VSCode first and SSH to a remote Linux machine, then proceed with the following steps. - -``` +> [!note] +> This guide assumes you have already installed VSCode in your environment. +> +> To run/develop on Windows, install VSCode and then follow the steps below. +> +> To run/develop on Linux, you might open VSCode first and SSH to a remote Linux machine, then proceed with the following steps. ## Install Docker -Follow the [Docker installation Guide](./docker_windows_gpu.html#install-docker) to install docker on either Linux or Windows. +Follow the [Docker installation Guide](./docker_windows_gpu.md#install-docker) to install docker on either Linux or Windows. ## Install Extensions for VSCcode @@ -52,19 +49,18 @@ Open the Terminal in VSCode (you can use the shortcut `` Ctrl+Shift+` ``), then docker pull intelanalytics/ipex-llm-xpu:latest ``` -Start ipex-llm-xpu Docker Container: +Start ipex-llm-xpu Docker Container. Choose one of the following commands to start the container: -```eval_rst -.. tabs:: - .. tab:: Linux +
+For Linux: - .. code-block:: bash + ```bash - export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest - export CONTAINER_NAME=my_container - export MODEL_PATH=/llm/models[change to your model path] + export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest + export CONTAINER_NAME=my_container + export MODEL_PATH=/llm/models[change to your model path] - docker run -itd \ + docker run -itd \ --net=host \ --device=/dev/dri \ --memory="32G" \ @@ -72,17 +68,19 @@ Start ipex-llm-xpu Docker Container: --shm-size="16g" \ -v $MODEL_PATH:/llm/models \ $DOCKER_IMAGE + ``` +
- .. tab:: Windows WSL +
+For Windows WSL: - .. code-block:: bash + ```bash + #/bin/bash + export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest + export CONTAINER_NAME=my_container + export MODEL_PATH=/llm/models[change to your model path] - #/bin/bash - export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest - export CONTAINER_NAME=my_container - export MODEL_PATH=/llm/models[change to your model path] - - sudo docker run -itd \ + sudo docker run -itd \ --net=host \ --privileged \ --device /dev/dri \ @@ -92,8 +90,10 @@ Start ipex-llm-xpu Docker Container: -v $MODEL_PATH:/llm/llm-models \ -v /usr/lib/wsl:/usr/lib/wsl \ $DOCKER_IMAGE -``` + ``` +
+--- ## Run/Develop Pytorch Examples diff --git a/docs/mddocs/DockerGuides/docker_windows_gpu.md b/docs/mddocs/DockerGuides/docker_windows_gpu.md index ce536f9b..0fd9a965 100644 --- a/docs/mddocs/DockerGuides/docker_windows_gpu.md +++ b/docs/mddocs/DockerGuides/docker_windows_gpu.md @@ -14,18 +14,12 @@ Follow the instructions in the [Offcial Docker Guide](https://www.docker.com/get ### Windows -```eval_rst -.. tip:: +> [!TIP] +> The installation requires at least 35GB of free disk space on C drive. - The installation requires at least 35GB of free disk space on C drive. +> [!NOTE] +> Detailed installation instructions for Windows, including steps for enabling WSL2, can be found on the [Docker Desktop for Windows installation page](https://docs.docker.com/desktop/install/windows-install/). -``` -```eval_rst -.. note:: - - Detailed installation instructions for Windows, including steps for enabling WSL2, can be found on the [Docker Desktop for Windows installation page](https://docs.docker.com/desktop/install/windows-install/). - -``` #### Install Docker Desktop for Windows Follow the instructions in [this guide](https://docs.docker.com/desktop/install/windows-install/) to install **Docker Desktop for Windows**. Restart you machine after the installation is complete. @@ -34,11 +28,9 @@ Follow the instructions in [this guide](https://docs.docker.com/desktop/install/ Follow the instructions in [this guide](https://docs.microsoft.com/en-us/windows/wsl/install) to install **Windows Subsystem for Linux 2 (WSL2)**. -```eval_rst -.. tip:: +> [!TIP] +> You may verify WSL2 installation by running the command `wsl --list` in PowerShell or Command Prompt. If WSL2 is installed, you will see a list of installed Linux distributions. - You may verify WSL2 installation by running the command `wsl --list` in PowerShell or Command Prompt. If WSL2 is installed, you will see a list of installed Linux distributions. -``` #### Enable Docker integration with WSL2 @@ -47,11 +39,10 @@ Open **Docker desktop**, and select `Settings`->`Resources`->`WSL integration`-> -```eval_rst -.. tip:: - If you encounter **Docker Engine stopped** when opening Docker Desktop, you can reopen it in administrator mode. -``` +> [!TIP] +> If you encounter **Docker Engine stopped** when opening Docker Desktop, you can reopen it in administrator mode. + #### Verify Docker is enabled in WSL2 @@ -67,11 +58,9 @@ You can see the output similar to the following: -```eval_rst -.. tip:: - During the use of Docker in WSL, Docker Desktop needs to be kept open all the time. -``` +> [!TIP] +> During the use of Docker in WSL, Docker Desktop needs to be kept open all the time. ## IPEX-LLM Docker Containers @@ -89,7 +78,7 @@ We have several docker images available for running LLMs on Intel GPUs. The foll | intelanalytics/ipex-llm-finetune-qlora-xpu:latest| GPU Finetuning|For fine-tuning LLMs using QLora/Lora, etc.| We have also provided several quickstarts for various usage scenarios: -- [Run and develop LLM applications in PyTorch](./docker_pytorch_inference_gpu.html) +- [Run and develop LLM applications in PyTorch](./docker_pytorch_inference_gpu.md) ... to be added soon. diff --git a/docs/mddocs/DockerGuides/fastchat_docker_quickstart.md b/docs/mddocs/DockerGuides/fastchat_docker_quickstart.md index 786316fd..f2a684e9 100644 --- a/docs/mddocs/DockerGuides/fastchat_docker_quickstart.md +++ b/docs/mddocs/DockerGuides/fastchat_docker_quickstart.md @@ -4,7 +4,7 @@ This guide demonstrates how to run `FastChat` serving with `IPEX-LLM` on Intel G ## Install docker -Follow the instructions in this [guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/docker_windows_gpu.html#linux) to install Docker on Linux. +Follow the instructions in this [guide](./docker_windows_gpu.md#linux) to install Docker on Linux. ## Pull the latest image @@ -17,7 +17,7 @@ docker pull intelanalytics/ipex-llm-serving-xpu:latest To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. Change the `/path/to/models` to mount the models. -``` +```bash #/bin/bash export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest export CONTAINER_NAME=ipex-llm-serving-xpu-container @@ -54,9 +54,9 @@ root@arda-arc12:/# sycl-ls For convenience, we have provided a script named `/llm/start-fastchat-service.sh` for you to start the service. -However, the script only provide instructions for the most common scenarios. If this script doesn't meet your needs, you can always find the complete guidance for FastChat at [Serving using IPEX-LLM and FastChat](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/fastchat_quickstart.html#start-the-service). +However, the script only provide instructions for the most common scenarios. If this script doesn't meet your needs, you can always find the complete guidance for FastChat at [Serving using IPEX-LLM and FastChat](../Quickstart/fastchat_quickstart.md#2-start-the-service). -Before starting the service, you can refer to this [section](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#runtime-configurations) to setup our recommended runtime configurations. +Before starting the service, you can refer to this [section](../Quickstart/install_linux_gpu.md#runtime-configurations) to setup our recommended runtime configurations. Now we can start the FastChat service, you can use our provided script `/llm/start-fastchat-service.sh` like the following way: @@ -105,10 +105,10 @@ The `vllm_worker` may start slowly than normal `ipex_llm_worker`. The booted se -```eval_rst -.. note:: - To verify/use the service booted by the script, follow the instructions in `this guide `_. -``` + +> [!note] +> To verify/use the service booted by the script, follow the instructions in [this guide](../Quickstart/fastchat_quickstart.md#launch-restful-api-server). + After a request has been sent to the `openai_api_server`, the corresponding inference time latency can be found in the worker log as shown below: diff --git a/docs/mddocs/DockerGuides/index.md b/docs/mddocs/DockerGuides/index.md new file mode 100644 index 00000000..79939382 --- /dev/null +++ b/docs/mddocs/DockerGuides/index.md @@ -0,0 +1,16 @@ +# IPEX-LLM Docker Container User Guides + + +In this section, you will find guides related to using IPEX-LLM with Docker, covering how to: + +- [Overview of IPEX-LLM Containers](./docker_windows_gpu.md) + +- Inference in Python/C++ + - [GPU Inference in Python with IPEX-LLM](./docker_pytorch_inference_gpu.md) + - [VSCode LLM Development with IPEX-LLM on Intel GPU](./docker_run_pytorch_inference_in_vscode.md) + - [llama.cpp/Ollama/Open-WebUI with IPEX-LLM on Intel GPU](./docker_cpp_xpu_quickstart.md) + +- Serving + - [FastChat with IPEX-LLM on Intel GPU](./fastchat_docker_quickstart.md) + - [vLLM with IPEX-LLM on Intel GPU](./vllm_docker_quickstart.md) + - [vLLM with IPEX-LLM on Intel CPU](./vllm_cpu_docker_quickstart.md) diff --git a/docs/mddocs/DockerGuides/index.rst b/docs/mddocs/DockerGuides/index.rst deleted file mode 100644 index 29781e52..00000000 --- a/docs/mddocs/DockerGuides/index.rst +++ /dev/null @@ -1,15 +0,0 @@ -IPEX-LLM Docker Container User Guides -===================================== - -In this section, you will find guides related to using IPEX-LLM with Docker, covering how to: - -* `Overview of IPEX-LLM Containers <./docker_windows_gpu.html>`_ - -* Inference in Python/C++ - * `GPU Inference in Python with IPEX-LLM <./docker_pytorch_inference_gpu.html>`_ - * `VSCode LLM Development with IPEX-LLM on Intel GPU <./docker_pytorch_inference_gpu.html>`_ - * `llama.cpp/Ollama/Open-WebUI with IPEX-LLM on Intel GPU <./docker_cpp_xpu_quickstart.html>`_ -* Serving - * `FastChat with IPEX-LLM on Intel GPU <./fastchat_docker_quickstart.html>`_ - * `vLLM with IPEX-LLM on Intel GPU <./vllm_docker_quickstart.html>`_ - * `vLLM with IPEX-LLM on Intel CPU <./vllm_cpu_docker_quickstart.html>`_ diff --git a/docs/mddocs/DockerGuides/vllm_cpu_docker_quickstart.md b/docs/mddocs/DockerGuides/vllm_cpu_docker_quickstart.md index 36b39ed5..231e7bfa 100644 --- a/docs/mddocs/DockerGuides/vllm_cpu_docker_quickstart.md +++ b/docs/mddocs/DockerGuides/vllm_cpu_docker_quickstart.md @@ -18,7 +18,7 @@ docker pull intelanalytics/ipex-llm-serving-cpu:latest ## Start Docker Container To fully use your Intel CPU to run vLLM inference and serving, you should -``` +```bash #/bin/bash export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-cpu:latest export CONTAINER_NAME=ipex-llm-serving-cpu-container @@ -48,7 +48,7 @@ We have included multiple vLLM-related files in `/llm/`: 3. `payload-1024.lua`: Used for testing request per second using 1k-128 request 4. `start-vllm-service.sh`: Used for template for starting vLLM service -Before performing benchmark or starting the service, you can refer to this [section](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_cpu.html#environment-setup) to setup our recommended runtime configurations. +Before performing benchmark or starting the service, you can refer to this [section](../Overview/install_cpu.md#environment-setup) to setup our recommended runtime configurations. ### Service @@ -92,7 +92,7 @@ You can tune the service using these four arguments: - `--max-num-batched-token` - `--max-num-seq` -You can refer to this [doc](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#service) for a detailed explaination on these parameters. +You can refer to this [doc](../Quickstart/vLLM_quickstart.md#service) for a detailed explaination on these parameters. ### Benchmark @@ -115,4 +115,4 @@ wrk -t8 -c8 -d15m -s payload-1024.lua http://localhost:8000/v1/completions --tim #### Offline benchmark through benchmark_vllm_throughput.py -Please refer to this [section](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#performing-benchmark) on how to use `benchmark_vllm_throughput.py` for benchmarking. +Please refer to this [section](../Quickstart/vLLM_quickstart.md#5performing-benchmark) on how to use `benchmark_vllm_throughput.py` for benchmarking. diff --git a/docs/mddocs/DockerGuides/vllm_docker_quickstart.md b/docs/mddocs/DockerGuides/vllm_docker_quickstart.md index eb7fff3e..27668e76 100644 --- a/docs/mddocs/DockerGuides/vllm_docker_quickstart.md +++ b/docs/mddocs/DockerGuides/vllm_docker_quickstart.md @@ -4,7 +4,7 @@ This guide demonstrates how to run `vLLM` serving with `IPEX-LLM` on Intel GPUs ## Install docker -Follow the instructions in this [guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/docker_windows_gpu.html#linux) to install Docker on Linux. +Follow the instructions in this [guide](./docker_windows_gpu.md#linux) to install Docker on Linux. ## Pull the latest image @@ -18,7 +18,7 @@ docker pull intelanalytics/ipex-llm-serving-xpu:latest To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. Change the `/path/to/models` to mount the models. -``` +```bash #/bin/bash export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest export CONTAINER_NAME=ipex-llm-serving-xpu-container @@ -58,7 +58,7 @@ We have included multiple vLLM-related files in `/llm/`: 3. `payload-1024.lua`: Used for testing request per second using 1k-128 request 4. `start-vllm-service.sh`: Used for template for starting vLLM service -Before performing benchmark or starting the service, you can refer to this [section](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#runtime-configurations) to setup our recommended runtime configurations. +Before performing benchmark or starting the service, you can refer to this [section](../Quickstart/install_linux_gpu.md#runtime-configurations) to setup our recommended runtime configurations. ### Service @@ -82,7 +82,7 @@ If the service have booted successfully, you should see the output similar to th vLLM supports to utilize multiple cards through tensor parallel. -You can refer to this [documentation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#about-tensor-parallel) on how to utilize the `tensor-parallel` feature and start the service. +You can refer to this [documentation](../Quickstart/vLLM_quickstart.md#4-about-tensor-parallel) on how to utilize the `tensor-parallel` feature and start the service. #### Verify After the service has been booted successfully, you can send a test request using `curl`. Here, `YOUR_MODEL` should be set equal to `served_model_name` in your booting script, e.g. `Qwen1.5`. @@ -113,7 +113,7 @@ You can tune the service using these four arguments: - `--max-num-batched-token` - `--max-num-seq` -You can refer to this [doc](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#service) for a detailed explaination on these parameters. +You can refer to this [doc](../Quickstart/vLLM_quickstart.md#service) for a detailed explaination on these parameters. ### Benchmark @@ -143,4 +143,4 @@ The following figure shows performing benchmark on `Llama-2-7b-chat-hf` using th #### Offline benchmark through benchmark_vllm_throughput.py -Please refer to this [section](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#performing-benchmark) on how to use `benchmark_vllm_throughput.py` for benchmarking. +Please refer to this [section](../Quickstart/vLLM_quickstart.md#5performing-benchmark) on how to use `benchmark_vllm_throughput.py` for benchmarking.