# Python Inference using IPEX-LLM on Intel GPU
We can run PyTorch Inference Benchmark, Chat Service and PyTorch Examples on Intel GPUs within Docker (on Linux or WSL).
> [!NOTE]
> The current Windows + WSL + Docker solution only supports Arc series dGPU. For Windows users with MTL iGPU, it is recommended to install directly via pip install in Miniforge Prompt. Refer to [this guide](../Quickstart/install_windows_gpu.md).
## Install Docker
Follow the [Docker installation Guide](./docker_windows_gpu.md#install-docker) to install docker on either Linux or Windows.
## Launch Docker
Prepare ipex-llm-xpu Docker Image:
```bash
docker pull intelanalytics/ipex-llm-xpu:latest
```
Start ipex-llm-xpu Docker Container. Choose one of the following commands to start the container:
For Linux:
```bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest
export CONTAINER_NAME=my_container
export MODEL_PATH=/llm/models[change to your model path]
docker run -itd \
--net=host \
--device=/dev/dri \
--memory="32G" \
--name=$CONTAINER_NAME \
--shm-size="16g" \
-v $MODEL_PATH:/llm/models \
$DOCKER_IMAGE
```
For Windows WSL:
```bash
#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest
export CONTAINER_NAME=my_container
export MODEL_PATH=/llm/models[change to your model path]
sudo docker run -itd \
--net=host \
--privileged \
--device /dev/dri \
--memory="32G" \
--name=$CONTAINER_NAME \
--shm-size="16g" \
-v $MODEL_PATH:/llm/llm-models \
-v /usr/lib/wsl:/usr/lib/wsl \
$DOCKER_IMAGE
```
---
Access the container:
```
docker exec -it $CONTAINER_NAME bash
```
To verify the device is successfully mapped into the container, run `sycl-ls` to check the result. In a machine with Arc A770, the sampled output is:
```bash
root@arda-arc12:/# sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K 3.0 [2023.16.7.0.21_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
```
> [!TIP]
> You can run the Env-Check script to verify your ipex-llm installation and runtime environment.
>
> ```bash
> cd /ipex-llm/python/llm/scripts
> bash env-check.sh
> ```
## Run Inference Benchmark
Navigate to benchmark directory, and modify the `config.yaml` under the `all-in-one` folder for benchmark configurations.
```bash
cd /benchmark/all-in-one
vim config.yaml
```
In the `config.yaml`, change `repo_id` to the model you want to test and `local_model_hub` to point to your model hub path.
```yaml
...
repo_id:
- 'meta-llama/Llama-2-7b-chat-hf'
local_model_hub: '/path/to/your/mode/folder'
...
```
After modifying `config.yaml`, run the following commands to run benchmarking:
```bash
source ipex-llm-init --gpu --device
python run.py
```
**Result Interpretation**
After the benchmarking is completed, you can obtain a CSV result file under the current folder. You can mainly look at the results of columns `1st token avg latency (ms)` and `2+ avg latency (ms/token)` for the benchmark results. You can also check whether the column `actual input/output tokens` is consistent with the column `input/output tokens` and whether the parameters you specified in `config.yaml` have been successfully applied in the benchmarking.
## Run Chat Service
We provide `chat.py` for conversational AI.
For example, if your model is Llama-2-7b-chat-hf and mounted on /llm/models, you can execute the following command to initiate a conversation:
```bash
cd /llm
python chat.py --model-path /llm/models/Llama-2-7b-chat-hf
```
Here is a demostration:
## Run PyTorch Examples
We provide several PyTorch examples that you could apply IPEX-LLM INT4 optimizations on models on Intel GPUs
For example, if your model is Llama-2-7b-chat-hf and mounted on /llm/models, you can navigate to /examples/llama2 directory, excute the following command to run example:
```bash
cd /examples/
python ./generate.py --repo-id-or-model-path /llm/models/Llama-2-7b-chat-hf --prompt PROMPT --n-predict N_PREDICT
```
Arguments info:
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Llama2 model (e.g. `meta-llama/Llama-2-7b-chat-hf` and `meta-llama/Llama-2-13b-chat-hf`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'meta-llama/Llama-2-7b-chat-hf'`.
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`.
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
**Sample Output**
```log
Inference time: xxxx s
-------------------- Prompt --------------------
[INST] <>
<>
What is AI? [/INST]
-------------------- Output --------------------
[INST] <>
<>
What is AI? [/INST] Artificial intelligence (AI) is the broader field of research and development aimed at creating machines that can perform tasks that typically require human intelligence,
```