Update cpp docker quickstart (#11040)

* add sample output

* update link

* update

* update header

* update
This commit is contained in:
Wang, Jian4 2024-05-16 14:55:13 +08:00 committed by GitHub
parent c62e828281
commit 00d4410746
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
6 changed files with 67 additions and 21 deletions

View file

@ -1,4 +1,4 @@
## Run llama.cpp/Ollama/open-webui with Docker on Intel GPU ## Run llama.cpp/Ollama/Open-WebUI on an Intel GPU via Docker
### Install Docker ### Install Docker
@ -11,7 +11,7 @@
For Windows installation, refer to this [guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/docker_windows_gpu.html#install-docker-desktop-for-windows). For Windows installation, refer to this [guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/docker_windows_gpu.html#install-docker-desktop-for-windows).
#### Setting Docker on windows #### Setting Docker on windows
If you want to run this image on windows, please refer to (this document)[https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/docker_windows_gpu.html#install-docker-on-windows] to set up Docker on windows. Then run below steps on wls ubuntu. And you need to enable `--net=host`,follow [this guide](https://docs.docker.com/network/drivers/host/#docker-desktop) so that you can easily access the service running on the docker. The [v6.1x kernel version wsl]( https://learn.microsoft.com/en-us/community/content/wsl-user-msft-kernel-v6#1---building-the-microsoft-linux-kernel-v61x) is recommended to use.Otherwise, you may encounter the blocking issue before loading the model to GPU. Need to enable `--net=host`,follow [this guide](https://docs.docker.com/network/drivers/host/#docker-desktop) so that you can easily access the service running on the docker. The [v6.1x kernel version wsl]( https://learn.microsoft.com/en-us/community/content/wsl-user-msft-kernel-v6#1---building-the-microsoft-linux-kernel-v61x) is recommended to use.Otherwise, you may encounter the blocking issue before loading the model to GPU.
### Pull the latest image ### Pull the latest image
```bash ```bash

View file

@ -65,9 +65,6 @@
<a href="doc/LLM/Quickstart/deepspeed_autotp_fastapi_quickstart.html">Run IPEX-LLM serving on Multiple Intel GPUs <a href="doc/LLM/Quickstart/deepspeed_autotp_fastapi_quickstart.html">Run IPEX-LLM serving on Multiple Intel GPUs
using DeepSpeed AutoTP and FastApi</a> using DeepSpeed AutoTP and FastApi</a>
</li> </li>
<li>
<a href="doc/LLM/Quickstart/docker_cpp_xpu_quickstart.html">Run llama.cpp/Ollama/open-webui with Docker on Intel GPU</a>
</li>
</ul> </ul>
</li> </li>
<li> <li>
@ -83,6 +80,9 @@
<li> <li>
<a href="doc/LLM/DockerGuides/docker_pytorch_inference_gpu.html">Run PyTorch Inference on an Intel GPU via Docker</a> <a href="doc/LLM/DockerGuides/docker_pytorch_inference_gpu.html">Run PyTorch Inference on an Intel GPU via Docker</a>
</li> </li>
<li>
<a href="doc/LLM/DockerGuides/docker_cpp_xpu_quickstart.html">Run llama.cpp/Ollama/Open-WebUI on an Intel GPU via Docker</a>
</li>
</ul> </ul>
</li> </li>
<li> <li>

View file

@ -21,6 +21,7 @@ subtrees:
- entries: - entries:
- file: doc/LLM/DockerGuides/docker_windows_gpu - file: doc/LLM/DockerGuides/docker_windows_gpu
- file: doc/LLM/DockerGuides/docker_pytorch_inference_gpu - file: doc/LLM/DockerGuides/docker_pytorch_inference_gpu
- file: doc/LLM/DockerGuides/docker_cpp_xpu_quickstart
- file: doc/LLM/Quickstart/index - file: doc/LLM/Quickstart/index
title: "Quickstart" title: "Quickstart"
subtrees: subtrees:
@ -41,7 +42,6 @@ subtrees:
- file: doc/LLM/Quickstart/fastchat_quickstart - file: doc/LLM/Quickstart/fastchat_quickstart
- file: doc/LLM/Quickstart/axolotl_quickstart - file: doc/LLM/Quickstart/axolotl_quickstart
- file: doc/LLM/Quickstart/deepspeed_autotp_fastapi_quickstart - file: doc/LLM/Quickstart/deepspeed_autotp_fastapi_quickstart
- file: doc/LLM/Quickstart/docker_cpp_xpu_quickstart
- file: doc/LLM/Overview/KeyFeatures/index - file: doc/LLM/Overview/KeyFeatures/index
title: "Key Features" title: "Key Features"
subtrees: subtrees:

View file

@ -1,4 +1,4 @@
## Run llama.cpp/Ollama/open-webui with Docker on Intel GPU ## Run llama.cpp/Ollama/Open-WebUI on an Intel GPU via Docker
## Quick Start ## Quick Start
@ -13,7 +13,8 @@
For Windows installation, refer to this [guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/docker_windows_gpu.html#install-docker-desktop-for-windows). For Windows installation, refer to this [guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/docker_windows_gpu.html#install-docker-desktop-for-windows).
#### Setting Docker on windows #### Setting Docker on windows
If you want to run this image on windows, please refer to (this document)[https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/docker_windows_gpu.html#install-docker-on-windows] to set up Docker on windows. And run below steps on wls ubuntu. And you need to enable `--net=host`,follow [this guide](https://docs.docker.com/network/drivers/host/#docker-desktop) so that you can easily access the service running on the docker. The [v6.1x kernel version wsl]( https://learn.microsoft.com/en-us/community/content/wsl-user-msft-kernel-v6#1---building-the-microsoft-linux-kernel-v61x) is recommended to use.Otherwise, you may encounter the blocking issue before loading the model to GPU.
Need to enable `--net=host`,follow [this guide](https://docs.docker.com/network/drivers/host/#docker-desktop) so that you can easily access the service running on the docker. The [v6.1x kernel version wsl]( https://learn.microsoft.com/en-us/community/content/wsl-user-msft-kernel-v6#1---building-the-microsoft-linux-kernel-v61x) is recommended to use.Otherwise, you may encounter the blocking issue before loading the model to GPU.
### Pull the latest image ### Pull the latest image
```bash ```bash
@ -48,7 +49,7 @@ docker pull intelanalytics/ipex-llm-inference-cpp-xpu:latest
.. tab:: Windows .. tab:: Windows
To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. And change the `/path/to/models` to mount the models. Then add `--privileged ` and map the `/usr/lib/wsl` to the docker. To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. And change the `/path/to/models` to mount the models. Then add `--privileged` and map the `/usr/lib/wsl` to the docker.
.. code-block:: bash .. code-block:: bash
@ -95,15 +96,16 @@ Notice that the performance on windows wsl docker is a little slower than on win
```bash ```bash
bash /llm/scripts/benchmark_llama-cpp.sh bash /llm/scripts/benchmark_llama-cpp.sh
# benchmark results
llama_print_timings: load time = xxx ms
llama_print_timings: sample time = xxx ms / xxx runs ( xxx ms per token, xxx tokens per second)
llama_print_timings: prompt eval time = xxx ms / xxx tokens ( xxx ms per token, xxx tokens per second)
llama_print_timings: eval time = xxx ms / 128 runs ( xxx ms per token, xxx tokens per second)
llama_print_timings: total time = xxx ms / xxx tokens
``` ```
The benchmark will run three times to warm up to get the accurate results, and the example output is like:
```bash
llama_print_timings: load time = xxx ms
llama_print_timings: sample time = xxx ms / 128 runs ( xxx ms per token, xxx tokens per second)
llama_print_timings: prompt eval time = xxx ms / xxx tokens ( xxx ms per token, xxx tokens per second)
llama_print_timings: eval time = xxx ms / 127 runs ( xxx ms per token, xxx tokens per second)
llama_print_timings: total time = xxx ms / xxx tokens
```
### Running llama.cpp inference with IPEX-LLM on Intel GPU ### Running llama.cpp inference with IPEX-LLM on Intel GPU
@ -115,6 +117,15 @@ source ipex-llm-init --gpu --device $DEVICE
bash start-llama-cpp.sh bash start-llama-cpp.sh
``` ```
The example output is like:
```bash
llama_print_timings: load time = xxx ms
llama_print_timings: sample time = xxx ms / 32 runs ( xxx ms per token, xxx tokens per second)
llama_print_timings: prompt eval time = xxx ms / xxx tokens ( xxx ms per token, xxx tokens per second)
llama_print_timings: eval time = xxx ms / 31 runs ( xxx ms per token, xxx tokens per second)
llama_print_timings: total time = xxx ms / xxx tokens
```
Please refer to this [documentation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html) for more details. Please refer to this [documentation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html) for more details.
@ -125,7 +136,18 @@ Running the ollama on the background, you can see the ollama.log in `/root/ollam
cd /llm/scripts/ cd /llm/scripts/
# set the recommended Env # set the recommended Env
source ipex-llm-init --gpu --device $DEVICE source ipex-llm-init --gpu --device $DEVICE
bash start-ollama.sh # ctrl+c to exit bash start-ollama.sh # ctrl+c to exit, and the ollama serve will run on the background
```
Sample output:
```bash
time=2024-05-16T10:45:33.536+08:00 level=INFO source=images.go:697 msg="total blobs: 0"
time=2024-05-16T10:45:33.536+08:00 level=INFO source=images.go:704 msg="total unused blobs removed: 0"
time=2024-05-16T10:45:33.536+08:00 level=INFO source=routes.go:1044 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-05-16T10:45:33.537+08:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama751325299/runners
time=2024-05-16T10:45:33.565+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]"
time=2024-05-16T10:45:33.565+08:00 level=INFO source=gpu.go:122 msg="Detecting GPUs"
time=2024-05-16T10:45:33.566+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
``` ```
#### Run Ollama models (interactive) #### Run Ollama models (interactive)
@ -142,6 +164,13 @@ PARAMETER num_predict 64
./ollama run example ./ollama run example
``` ```
An example process of interacting with model with `ollama run example` looks like the following:
<a href="https://llm-assets.readthedocs.io/en/latest/_images/ollama_gguf_demo_image.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/ollama_gguf_demo_image.png" width=100%; />
</a>
#### Pull models from ollama to serve #### Pull models from ollama to serve
```bash ```bash
@ -159,6 +188,12 @@ curl http://localhost:11434/api/generate -d '
}' }'
``` ```
Sample output:
```bash
{"model":"llama2","created_at":"2024-05-16T02:52:18.972296097Z","response":"\nArtificial intelligence (AI) refers to the development of computer systems that can perform tasks that typically require human intelligence, such as learning, problem-solving, and decision-making. AI systems use algorithms and data to mimic human behavior and perform tasks such as:\n\n1. Image recognition: AI can identify objects in images and classify them into different categories.\n2. Natural Language Processing (NLP): AI can understand and generate human language, allowing it to interact with humans through voice assistants or chatbots.\n3. Predictive analytics: AI can analyze data to make predictions about future events, such as stock prices or weather patterns.\n4. Robotics: AI can control robots that perform tasks such as assembly, maintenance, and logistics.\n5. Recommendation systems: AI can suggest products or services based on a user's past behavior or preferences.\n6. Autonomous vehicles: AI can control self-driving cars that can navigate through roads and traffic without human intervention.\n7. Fraud detection: AI can identify and flag fraudulent transactions, such as credit card purchases or insurance claims.\n8. Personalized medicine: AI can analyze genetic data to provide personalized medical recommendations, such as drug dosages or treatment plans.\n9. Virtual assistants: AI can interact with users through voice or text interfaces, providing information or completing tasks.\n10. Sentiment analysis: AI can analyze text or speech to determine the sentiment or emotional tone of a message.\n\nThese are just a few examples of what AI can do. As the technology continues to evolve, we can expect to see even more innovative applications of AI in various industries and aspects of our lives.","done":true,"context":[xxx,xxx],"total_duration":12831317190,"load_duration":6453932096,"prompt_eval_count":25,"prompt_eval_duration":254970000,"eval_count":390,"eval_duration":6079077000}
```
Please refer to this [documentation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/ollama_quickstart.html#pull-model) for more details. Please refer to this [documentation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/ollama_quickstart.html#pull-model) for more details.
@ -169,7 +204,18 @@ If you have difficulty accessing the huggingface repositories, you may use a mir
```bash ```bash
cd /llm/scripts/ cd /llm/scripts/
bash start-open-webui.sh bash start-open-webui.sh
# INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
``` ```
Sample output:
```bash
INFO: Started server process [1055]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
```
<a href="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_signup.png" target="_blank">
<img src="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_signup.png" width="100%" />
</a>
For how to log-in or other guide, Please refer to this [documentation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/open_webui_with_ollama_quickstart.html) for more details. For how to log-in or other guide, Please refer to this [documentation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/open_webui_with_ollama_quickstart.html) for more details.

View file

@ -5,4 +5,5 @@ In this section, you will find guides related to using IPEX-LLM with Docker, cov
* `Overview of IPEX-LLM Containers for Intel GPU <./docker_windows_gpu.html>`_ * `Overview of IPEX-LLM Containers for Intel GPU <./docker_windows_gpu.html>`_
* `Run PyTorch Inference on an Intel GPU via Docker <./docker_pytorch_inference_gpu.html>`_ * `Run PyTorch Inference on an Intel GPU via Docker <./docker_pytorch_inference_gpu.html>`_
* `Run llama.cpp/Ollama/open-webui with Docker on Intel GPU <./docker_cpp_xpu_quickstart.html>`_

View file

@ -25,8 +25,7 @@ This section includes efficient guide to show you how to:
* `Run Llama 3 on Intel GPU using llama.cpp and ollama with IPEX-LLM <./llama3_llamacpp_ollama_quickstart.html>`_ * `Run Llama 3 on Intel GPU using llama.cpp and ollama with IPEX-LLM <./llama3_llamacpp_ollama_quickstart.html>`_
* `Run IPEX-LLM Serving with FastChat <./fastchat_quickstart.html>`_ * `Run IPEX-LLM Serving with FastChat <./fastchat_quickstart.html>`_
* `Finetune LLM with Axolotl on Intel GPU <./axolotl_quickstart.html>`_ * `Finetune LLM with Axolotl on Intel GPU <./axolotl_quickstart.html>`_
* `Run IPEX-LLM serving on Multiple Intel GPUs using DeepSpeed AutoTP and FastApi <./deepspeed_autotp_fastapi_quickstart.html>` * `Run IPEX-LLM serving on Multiple Intel GPUs using DeepSpeed AutoTP and FastApi <./deepspeed_autotp_fastapi_quickstart.html>`_
* `Run llama.cpp/Ollama/open-webui with Docker on Intel GPU <./docker_cpp_xpu_quickstart.html>`
.. |bigdl_llm_migration_guide| replace:: ``bigdl-llm`` Migration Guide .. |bigdl_llm_migration_guide| replace:: ``bigdl-llm`` Migration Guide