refine and verify ipex-inference-cpu docker document (#10565)

* restructure the index * refine and verify cpu-inference document * update
2024-03-29 10:16:10 +08:00 · 2024-03-29 10:16:10 +08:00 · 52f1b541cf
commit 52f1b541cf
parent 0136fad1d4
2 changed files with 150 additions and 132 deletions
--- a/docker/llm/README.md
+++ b/docker/llm/README.md
@ -1,148 +1,166 @@
 # Getting started with IPEX-LLM in Docker
 This guide provides step-by-step instructions for installing and using IPEX-LLM in a Docker environment. It covers setups for both CPU and XPU (accelerated processing units) on different operating systems.
 ### Index
- [Docker installation guide for IPEX-LLM on CPU](#docker-installation-guide-for-ipex-llm-on-cpu)
+- [Docker Installation](#docker-installation-instructions)
-    - [IPEX-LLM on Windows](#ipex-llm-on-windows)
+- [IPEX-LLM Inference](#ipex-llm-inference-on-cpu)
-    - [IPEX-LLM on Linux/MacOS](#ipex-llm-on-linuxmacos)
+    - [On CPU](#ipex-llm-inference-on-cpu)
- [Docker installation guide for IPEX LLM on XPU](#docker-installation-guide-for-ipex-llm-on-xpu) 
+    - [On XPU](#ipex-llm-inference-on-xpu)
- [Docker installation guide for IPEX LLM Serving on CPU](#docker-installation-guide-for-ipex-llm-serving-on-cpu) 
+- [IPEX-LLM Serving](#ipex-llm-serving-on-cpu)
- [Docker installation guide for IPEX LLM Serving on XPU](#docker-installation-guide-for-ipex-llm-serving-on-xpu) 
+    - [On CPU](#ipex-llm-serving-on-cpu)
- [Docker installation guide for IPEX LLM Fine Tuning on CPU](#docker-installation-guide-for-ipex-llm-fine-tuning-on-cpu) 
+    - [On XPU](#ipex-llm-serving-on-xpu)
- [Docker installation guide for IPEX LLM Fine Tuning on XPU](#docker-installation-guide-for-ipex-llm-fine-tuning-on-xpu) 
+- [IPEX-LLM Fine Tuning](#ipex-llm-fine-tuning-on-cpu)
-
+    - [On CPU](#ipex-llm-fine-tuning-on-cpu)
-## Docker installation guide for IPEX-LLM on CPU
+    - [On XPU](#ipex-llm-fine-tuning-on-xpu)
 ### IPEX-LLM on Windows
 #### Install docker
 New users can quickly get started with Docker using this [official link](https://www.docker.com/get-started/).
 For Windows users, make sure WSL2 or Hyper-V is enabled on your computer. 
 The instructions for installing can be accessed from 
 [here](https://docs.docker.com/desktop/install/windows-install/).
-#### Pull ipex-llm-cpu image
+## Docker Installation Instructions
-To pull image from hub, you can execute command on console:
+1. **For New Users:**
   - Begin by visiting the [official Docker Get Started page](https://www.docker.com/get-started/) for a comprehensive introduction and installation guide.
 2. **Additional Steps for Windows Users:**
   - For Windows Users, follow the step-by-step guide: [Docker Installation Instructions for Windows](https://github.com/intel-analytics/ipex-llm/blob/main/docs/readthedocs/source/doc/LLM/Quickstart/docker_windows_gpu.md).
 ## IPEX-LLM Inference on CPU
 ### 1. Prepare ipex-llm-cpu Docker Image
 Run the following command to pull image from dockerhub:
 ```bash
-docker pull intelanalytics/ipex-llm-cpu:2.5.0-SNAPSHOT
+docker pull intelanalytics/ipex-llm-cpu:2.1.0-SNAPSHOT
 ```
 to check if the image is successfully downloaded, you can use:
 ```powershell
 docker images | sls intelanalytics/ipex-llm-cpu:2.5.0-SNAPSHOT
 ```
 ### 2. Start bigdl-llm-cpu Docker Container
 #### Start ipex-llm-cpu container
 To run the image and do inference, you could create and run a bat script on Windows.
 An example on Windows could be:
 ```bat
@echo off
 set DOCKER_IMAGE=intelanalytics/ipex-llm-cpu:2.5.0-SNAPSHOT
 set CONTAINER_NAME=my_container
 set MODEL_PATH=D:/llm/models[change to your model path]
 :: Run the Docker container
 docker run -itd ^
    -p 12345:12345 ^
    --cpuset-cpus="0-7" ^
    --cpuset-mems="0" ^
    --memory="8G" ^
    --name=%CONTAINER_NAME% ^
    -v %MODEL_PATH%:/llm/models ^
    %DOCKER_IMAGE%
 ```
 After the container is booted, you could get into the container through `docker exec`.
 ```
 docker exec -it my_container bash
 ```
 To run inference using `IPEX-LLM` using cpu, you could refer to this [documentation](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm#cpu-int4).
 #### Getting started with chat
 chat.py can be used to initiate a conversation with a specified model. The file is under directory '/llm'.
 You can download models and bind the model directory from host machine to container when start a container.
 After entering the container through `docker exec`, you can run chat.py by:
 ```bash
 cd /llm/portable-zip
 python chat.py --model-path YOUR_MODEL_PATH
 ```
 If your model is chatglm-6b and mounted on /llm/models, you can excute:
 ```bash
 python chat.py --model-path /llm/models/chatglm-6b
 ```
 Here is a demostration:
 <a align="left"  href="https://llm-assets.readthedocs.io/en/latest/_images/llm-inference-cpu-docker-chatpy-demo.gif">
            <img src="https://llm-assets.readthedocs.io/en/latest/_images/llm-inference-cpu-docker-chatpy-demo.gif" width='60%' /> 
 </a>
 #### Getting started with tutorials
 You could start a jupyter-lab serving to explore ipex-llm-tutorial which can help you build a more sophisticated Chatbo.
 To start serving,  run the script under '/llm':
 ```bash
 cd /llm
 ./start-notebook.sh [--port EXPECTED_PORT]
 ```
 You could assign a port to serving, or the default port 12345 will be assigned.
 If you use host network mode when booted the container, after successfully running service, you can access http://127.0.0.1:12345/lab to get into tutorial, or you should bind the correct ports between container and host. 
 Here is a demostration of how to use tutorial in explorer:
 <a align="left" href="https://llm-assets.readthedocs.io/en/latest/_images/llm-inference-cpu-docker-tutorial-demo.gif">
            <img src="https://llm-assets.readthedocs.io/en/latest/_images/llm-inference-cpu-docker-tutorial-demo.gif" width='60%' /> 
 </a>
 ### IPEX-LLM on Linux/MacOS
 To run container on Linux/MacOS:
 ```bash
 #/bin/bash
-export DOCKER_IMAGE=intelanalytics/ipex-llm-cpu:2.5.0-SNAPSHOT
+export DOCKER_IMAGE=intelanalytics/ipex-llm-cpu:2.1.0-SNAPSHOT
 export CONTAINER_NAME=my_container
 export MODEL_PATH=/llm/models[change to your model path]
 docker run -itd \
    --privileged \
    -p 12345:12345 \
-    --cpuset-cpus="0-7" \
+    --cpuset-cpus="0-47" \
    --cpuset-mems="0" \
    --memory="8G" \
    --name=$CONTAINER_NAME \
    -v $MODEL_PATH:/llm/models \
    $DOCKER_IMAGE
 ```
-Also, you could use chat.py and ipex-llm-tutorial for development.
+Access the container:
 ```
 docker exec -it $CONTAINER_NAME bash
 ```
-[Getting started with chat](#getting-started-with-chat)
+### 3. Start Inference and Tutorials
 **3.1 Chat Interface**: Use `chat.py` for conversational AI. 
-[Getting started with tutorials](#getting-started-with-tutorials)
+For example, if your model is chatglm-6b and mounted on /llm/models, you can excute the following command to initiate a conversation:
  ```bash
  cd /llm/portable-zip
  python chat.py --model-path /llm/models/chatglm2-6b
  ```
 Here is a demostration:
-## Docker installation guide for IPEX LLM on XPU
+<a align="left"  href="https://llm-assets.readthedocs.io/en/latest/_images/llm-inference-cpu-docker-chatpy-demo.gif">
            <img src="https://llm-assets.readthedocs.io/en/latest/_images/llm-inference-cpu-docker-chatpy-demo.gif" width='60%' /> 
 </a><br>
 **3.2 Jupyter Lab Tutorials**: Start a Jupyter Lab session for IPEX-LLM tutorials.
 Run the following command to start notebook:
 ```bash
 cd /llm
 ./start-notebook.sh [--port EXPECTED_PORT]
 ```
 The default port is 12345, you could assign a different port by specifying the --port parameter.
 If you're using the host network mode when booting the container, once the service is running successfully, you can access the tutorial at http://127.0.0.1:12345/lab. Alternatively, you need to ensure the correct ports are bound between the container and the host. 
 Here's a demonstration of how to navigate the tutorial in the explorer:
 <a align="left" href="https://llm-assets.readthedocs.io/en/latest/_images/llm-inference-cpu-docker-tutorial-demo.gif">
            <img src="https://llm-assets.readthedocs.io/en/latest/_images/llm-inference-cpu-docker-tutorial-demo.gif" width='60%' /> 
 </a> <br>
 **3.3 Performance Benchmark**: We provide a benchmark tool help users to test all the benchmarks and record them in a result CSV. 
 ```bash
 cd /llm//benchmark/all-in-one
 ```
 Users can provide models and related information in config.yaml.
 ```bash
 repo_id:
  - 'THUDM/chatglm-6b'
  - 'THUDM/chatglm2-6b'
  - 'meta-llama/Llama-2-7b-chat-hf'
  # - 'liuhaotian/llava-v1.5-7b' # requires a LLAVA_REPO_DIR env variables pointing to the llava dir; added only for gpu win related test_api now
 local_model_hub: 'path to your local model hub'
 warm_up: 1
 num_trials: 3
 num_beams: 1 # default to greedy search
 low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4)
 batch_size: 1 # default to 1
 in_out_pairs:
  - '32-32'
  - '1024-128'
 test_api:
  - "transformer_int4"
  - "native_int4"
  - "optimize_model"
  - "pytorch_autocast_bf16"
  # - "transformer_autocast_bf16"
  # - "bigdl_ipex_bf16"
  # - "bigdl_ipex_int4"
  # - "bigdl_ipex_int8"
  # - "ipex_fp16_gpu" # on Intel GPU
  # - "bigdl_fp16_gpu" # on Intel GPU
  # - "transformer_int4_gpu"  # on Intel GPU
  # - "optimize_model_gpu"  # on Intel GPU
  # - "deepspeed_transformer_int4_cpu" # on Intel SPR Server
  # - "transformer_int4_gpu_win" # on Intel GPU for Windows
  # - "transformer_int4_fp16_gpu_win" # on Intel GPU for Windows, use fp16 for non-linear layer
  # - "transformer_int4_loadlowbit_gpu_win" # on Intel GPU for Windows using load_low_bit API. Please make sure you have used the save.py to save the converted low bit model
  # - "deepspeed_optimize_model_gpu" # deepspeed autotp on Intel GPU
  # - "speculative_cpu"
  # - "speculative_gpu"
 cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu win related test_api)
 streaming: False # whether output in streaming way (only avaiable now for gpu win related test_api)
 ```
 This benchmark tool offers various test APIs, including `transformer_int4`, `speculative_cpu`, and more.
 For instance, if you wish to benchmark **inference with speculative decoding**, utilize the `speculative_cpu` test API in the `config.yml` file. 
 Then, execute `bash run-spr.sh`, which will generate output results in `results.csv`.
 ```bash
 bash run-spr.sh
 ```
 For further details and comprehensive functionality of the benchmark tool, please refer to the [all-in-one benchmark tool](https://github.com/intel-analytics/BigDL/tree/main/python/llm/dev/benchmark/all-in-one).
 Additionally, for examples related to Inference with Speculative Decoding, you can explore [Speculative-Decoding Examples](https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/CPU/Speculative-Decoding).
 ## IPEX-LLM Inference on XPU
 First, pull docker image from docker hub:
 ```
-docker pull intelanalytics/ipex-llm-xpu:2.5.0-SNAPSHOT
+docker pull intelanalytics/ipex-llm-xpu:2.1.0-SNAPSHOT
 ```
 To map the xpu into the container, you need to specify --device=/dev/dri when booting the container.
 An example could be:
 ```bash
 #/bin/bash
-export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:2.5.0-SNAPSHOT
+export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:2.1.0-SNAPSHOT
 export CONTAINER_NAME=my_container
 export MODEL_PATH=/llm/models[change to your model path]
@ -170,18 +188,18 @@ root@arda-arc12:/# sycl-ls
 To run inference using `IPEX-LLM` using xpu, you could refer to this [documentation](https://github.com/intel-analytics/IPEX/tree/main/python/llm/example/GPU).
-## Docker installation guide for IPEX LLM Serving on CPU
+## IPEX-LLM Serving on CPU
 ### Boot container
 Pull image:
 ```
-docker pull intelanalytics/ipex-llm-serving-cpu:2.5.0-SNAPSHOT
+docker pull intelanalytics/ipex-llm-serving-cpu:2.1.0-SNAPSHOT
 ```
 You could use the following bash script to start the container. Please be noted that the CPU config is specified for Xeon CPUs, change it accordingly if you are not using a Xeon CPU.
 ```bash
-export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-cpu:2.5.0-SNAPSHOT
+export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-cpu:2.1.0-SNAPSHOT
 export CONTAINER_NAME=my_container
 export MODEL_PATH=/llm/models[change to your model path]
@ -259,13 +277,13 @@ python3 -m fastchat.serve.openai_api_server --host localhost --port 8000
 ```
-## Docker installation guide for IPEX LLM Serving on XPU
+## IPEX-LLM Serving on XPU
 ### Boot container
 Pull image:
 ```
-docker pull intelanalytics/ipex-llm-serving-xpu:2.5.0-SNAPSHOT
+docker pull intelanalytics/ipex-llm-serving-xpu:2.1.0-SNAPSHOT
 ```
 To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container.
@ -273,7 +291,7 @@ To map the `xpu` into the container, you need to specify `--device=/dev/dri` whe
 An example could be:
 ```bash
 #/bin/bash
-export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-cpu:2.5.0-SNAPSHOT
+export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-cpu:2.1.0-SNAPSHOT
 export CONTAINER_NAME=my_container
 export MODEL_PATH=/llm/models[change to your model path]
 export SERVICE_MODEL_PATH=/llm/models/chatglm2-6b[a specified model path for running service]
@ -352,7 +370,7 @@ Finally, launch the RESTful API server
 python3 -m fastchat.serve.openai_api_server --host localhost --port 8000
 ```
-## Docker installation guide for IPEX LLM Fine Tuning on CPU
+## IPEX-LLM Fine Tuning on CPU
 ### 1. Prepare Docker Image
@ -360,10 +378,10 @@ You can download directly from Dockerhub like:
 ```bash
 # For standalone
-docker pull intelanalytics/ipex-llm-finetune-qlora-cpu-standalone:2.5.0-SNAPSHOT
+docker pull intelanalytics/ipex-llm-finetune-qlora-cpu-standalone:2.1.0-SNAPSHOT
 # For k8s
-docker pull intelanalytics/ipex-llm-finetune-qlora-cpu-k8s:2.5.0-SNAPSHOT
+docker pull intelanalytics/ipex-llm-finetune-qlora-cpu-k8s:2.1.0-SNAPSHOT
 ```
 Or build the image from source:
@ -376,7 +394,7 @@ export HTTPS_PROXY=your_https_proxy
 docker build \
  --build-arg http_proxy=${HTTP_PROXY} \
  --build-arg https_proxy=${HTTPS_PROXY} \
-  -t intelanalytics/ipex-llm-finetune-qlora-cpu-standalone:2.5.0-SNAPSHOT \
+  -t intelanalytics/ipex-llm-finetune-qlora-cpu-standalone:2.1.0-SNAPSHOT \
  -f ./Dockerfile .
 # For k8s
@ -386,7 +404,7 @@ export HTTPS_PROXY=your_https_proxy
 docker build \
  --build-arg http_proxy=${HTTP_PROXY} \
  --build-arg https_proxy=${HTTPS_PROXY} \
-  -t intelanalytics/ipex-llm-finetune-qlora-cpu-k8s:2.5.0-SNAPSHOT \
+  -t intelanalytics/ipex-llm-finetune-qlora-cpu-k8s:2.1.0-SNAPSHOT \
  -f ./Dockerfile.k8s .
 ```
@ -407,7 +425,7 @@ docker run -itd \
   -e https_proxy=${HTTPS_PROXY} \
   -v $BASE_MODE_PATH:/ipex_llm/model \
   -v $DATA_PATH:/ipex_llm/data/alpaca-cleaned \
-   intelanalytics/ipex-llm-finetune-qlora-cpu-standalone:2.5.0-SNAPSHOT
+   intelanalytics/ipex-llm-finetune-qlora-cpu-standalone:2.1.0-SNAPSHOT
 ```
 The download and mount of base model and data to a docker container demonstrates a standard fine-tuning process. You can skip this step for a quick start, and in this way, the fine-tuning codes will automatically download the needed files:
@ -421,7 +439,7 @@ docker run -itd \
   --name=ipex-llm-fintune-qlora-cpu \
   -e http_proxy=${HTTP_PROXY} \
   -e https_proxy=${HTTPS_PROXY} \
-   intelanalytics/ipex-llm-finetune-qlora-cpu-standalone:2.5.0-SNAPSHOT
+   intelanalytics/ipex-llm-finetune-qlora-cpu-standalone:2.1.0-SNAPSHOT
 ```
 However, we do recommend you to handle them manually, because the automatical download can be blocked by Internet access and Huggingface authentication etc. according to different environment, and the manual method allows you to fine-tune in a custom way (with different base model and dataset).
@ -470,7 +488,7 @@ python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --
 Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference.
-## Docker installation guide for IPEX LLM Fine Tuning on XPU
+## IPEX-LLM Fine Tuning on XPU
 The following shows how to fine-tune LLM with Quantization (QLoRA built on IPEX-LLM 4bit optimizations) in a docker environment, which is accelerated by Intel XPU.
@ -479,7 +497,7 @@ The following shows how to fine-tune LLM with Quantization (QLoRA built on IPEX-
 You can download directly from Dockerhub like:
 ```bash
-docker pull intelanalytics/ipex-llm-finetune-qlora-xpu:2.5.0-SNAPSHOT
+docker pull intelanalytics/ipex-llm-finetune-qlora-xpu:2.1.0-SNAPSHOT
 ```
 Or build the image from source:
@ -491,7 +509,7 @@ export HTTPS_PROXY=your_https_proxy
 docker build \
  --build-arg http_proxy=${HTTP_PROXY} \
  --build-arg https_proxy=${HTTPS_PROXY} \
-  -t intelanalytics/ipex-llm-finetune-qlora-xpu:2.5.0-SNAPSHOT \
+  -t intelanalytics/ipex-llm-finetune-qlora-xpu:2.1.0-SNAPSHOT \
  -f ./Dockerfile .
 ```
@ -515,7 +533,7 @@ docker run -itd \
   -v $BASE_MODE_PATH:/model \
   -v $DATA_PATH:/data/alpaca-cleaned \
   --shm-size="16g" \
-   intelanalytics/ipex-llm-fintune-qlora-xpu:2.5.0-SNAPSHOT
+   intelanalytics/ipex-llm-fintune-qlora-xpu:2.1.0-SNAPSHOT
 ```
 The download and mount of base model and data to a docker container demonstrates a standard fine-tuning process. You can skip this step for a quick start, and in this way, the fine-tuning codes will automatically download the needed files:
@ -532,7 +550,7 @@ docker run -itd \
   -e http_proxy=${HTTP_PROXY} \
   -e https_proxy=${HTTPS_PROXY} \
   --shm-size="16g" \
-   intelanalytics/ipex-llm-fintune-qlora-xpu:2.5.0-SNAPSHOT
+   intelanalytics/ipex-llm-fintune-qlora-xpu:2.1.0-SNAPSHOT
 ```
 However, we do recommend you to handle them manually, because the automatical download can be blocked by Internet access and Huggingface authentication etc. according to different environment, and the manual method allows you to fine-tune in a custom way (with different base model and dataset).
--- a/docker/llm/inference/cpu/docker/README.md
+++ b/docker/llm/inference/cpu/docker/README.md
@ -6,7 +6,7 @@ docker build \
  --build-arg http_proxy=.. \
  --build-arg https_proxy=.. \
  --build-arg no_proxy=.. \
-  --rm --no-cache -t intelanalytics/ipex-llm-cpu:2.5.0-SNAPSHOT .
+  --rm --no-cache -t intelanalytics/ipex-llm-cpu:2.1.0-SNAPSHOT .
 ```
@ -16,7 +16,7 @@ docker build \
 An example could be:
 ```bash
 #/bin/bash
-export DOCKER_IMAGE=intelanalytics/ipex-llm-cpu:2.5.0-SNAPSHOT
+export DOCKER_IMAGE=intelanalytics/ipex-llm-cpu:2.1.0-SNAPSHOT
 sudo docker run -itd \
        --net=host \
@ -41,7 +41,7 @@ You can download models and bind the model directory from host machine to contai
 Here is an example:
 ```bash
-export DOCKER_IMAGE=intelanalytics/ipex-llm-cpu:2.5.0-SNAPSHOT
+export DOCKER_IMAGE=intelanalytics/ipex-llm-cpu:2.1.0-SNAPSHOT
 export MODEL_PATH=/home/llm/models
 sudo docker run -itd \