ipex-llm/docs/mddocs/DockerGuides/docker_pytorch_inference_gpu.md
Yuwen Hu 54f9d07d8f
Further mddocs fixes (#11386)
* Update mddocs for ragflow quickstart

* Fixes for docker guides mddocs

* Further fixes
2024-06-21 13:27:43 +08:00

5.6 KiB

Python Inference using IPEX-LLM on Intel GPU

We can run PyTorch Inference Benchmark, Chat Service and PyTorch Examples on Intel GPUs within Docker (on Linux or WSL).

Note

The current Windows + WSL + Docker solution only supports Arc series dGPU. For Windows users with MTL iGPU, it is recommended to install directly via pip install in Miniforge Prompt. Refer to this guide.

Install Docker

Follow the Docker installation Guide to install docker on either Linux or Windows.

Launch Docker

Prepare ipex-llm-xpu Docker Image:

docker pull intelanalytics/ipex-llm-xpu:latest

Start ipex-llm-xpu Docker Container. Choose one of the following commands to start the container:

  • For Linux users:

    export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest
    export CONTAINER_NAME=my_container
    export MODEL_PATH=/llm/models[change to your model path]
    
    docker run -itd \
              --net=host \
              --device=/dev/dri \
              --memory="32G" \
              --name=$CONTAINER_NAME \
              --shm-size="16g" \
              -v $MODEL_PATH:/llm/models \
              $DOCKER_IMAGE
    
  • For Windows WSL users:

    #/bin/bash
    export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest
    export CONTAINER_NAME=my_container
    export MODEL_PATH=/llm/models[change to your model path]
    
    sudo docker run -itd \
                  --net=host \
                  --privileged \
                  --device /dev/dri \
                  --memory="32G" \
                  --name=$CONTAINER_NAME \
                  --shm-size="16g" \
                  -v $MODEL_PATH:/llm/llm-models \
                  -v /usr/lib/wsl:/usr/lib/wsl \ 
                  $DOCKER_IMAGE
    

Access the container:

docker exec -it $CONTAINER_NAME bash

To verify the device is successfully mapped into the container, run sycl-ls to check the result. In a machine with Arc A770, the sampled output is:

root@arda-arc12:/# sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K 3.0 [2023.16.7.0.21_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]

Tip

You can run the Env-Check script to verify your ipex-llm installation and runtime environment.

cd /ipex-llm/python/llm/scripts
bash env-check.sh

Run Inference Benchmark

Navigate to benchmark directory, and modify the config.yaml under the all-in-one folder for benchmark configurations.

cd /benchmark/all-in-one
vim config.yaml

In the config.yaml, change repo_id to the model you want to test and local_model_hub to point to your model hub path.

...
repo_id:
  - 'meta-llama/Llama-2-7b-chat-hf'
local_model_hub: '/path/to/your/mode/folder'
...

After modifying config.yaml, run the following commands to run benchmarking:

source ipex-llm-init --gpu --device <value>
python run.py

Result Interpretation

After the benchmarking is completed, you can obtain a CSV result file under the current folder. You can mainly look at the results of columns 1st token avg latency (ms) and 2+ avg latency (ms/token) for the benchmark results. You can also check whether the column actual input/output tokens is consistent with the column input/output tokens and whether the parameters you specified in config.yaml have been successfully applied in the benchmarking.

Run Chat Service

We provide chat.py for conversational AI.

For example, if your model is Llama-2-7b-chat-hf and mounted on /llm/models, you can execute the following command to initiate a conversation:

cd /llm
python chat.py --model-path /llm/models/Llama-2-7b-chat-hf

Here is a demostration:


Run PyTorch Examples

We provide several PyTorch examples that you could apply IPEX-LLM INT4 optimizations on models on Intel GPUs

For example, if your model is Llama-2-7b-chat-hf and mounted on /llm/models, you can navigate to /examples/llama2 directory, excute the following command to run example:

cd /examples/<model_dir>
python ./generate.py --repo-id-or-model-path /llm/models/Llama-2-7b-chat-hf --prompt PROMPT --n-predict N_PREDICT

Arguments info:

  • --repo-id-or-model-path REPO_ID_OR_MODEL_PATH: argument defining the huggingface repo id for the Llama2 model (e.g. meta-llama/Llama-2-7b-chat-hf and meta-llama/Llama-2-13b-chat-hf) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be 'meta-llama/Llama-2-7b-chat-hf'.
  • --prompt PROMPT: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be 'What is AI?'.
  • --n-predict N_PREDICT: argument defining the max number of tokens to predict. It is default to be 32.

Sample Output

Inference time: xxxx s
-------------------- Prompt --------------------
<s>[INST] <<SYS>>

<</SYS>>

What is AI? [/INST]
-------------------- Output --------------------
[INST] <<SYS>>

<</SYS>>

What is AI? [/INST]  Artificial intelligence (AI) is the broader field of research and development aimed at creating machines that can perform tasks that typically require human intelligence,