Merge CPU & XPU Dockerfiles with Serving Images and Refactor (#12815)
* Update Dockerfile * Update Dockerfile * Ensure scripts are executable * Update Dockerfile * Update Dockerfile * Update Dockerfile * Update Dockerfile * Update Dockerfile * Update Dockerfile * update * Update Dockerfile * remove inference-cpu and inference-xpu * update README
This commit is contained in:
		
							parent
							
								
									eaec64baca
								
							
						
					
					
						commit
						f7b5a093a7
					
				
					 17 changed files with 467 additions and 621 deletions
				
			
		| 
						 | 
				
			
			@ -13,20 +13,19 @@ You can run IPEX-LLM containers (via docker or k8s) for inference, serving and f
 | 
			
		|||
#### Pull a IPEX-LLM Docker Image
 | 
			
		||||
To pull IPEX-LLM Docker images from [Docker Hub](https://hub.docker.com/u/intelanalytics), use the `docker pull` command. For instance, to pull the CPU inference image:
 | 
			
		||||
```bash
 | 
			
		||||
docker pull intelanalytics/ipex-llm-cpu:2.2.0-SNAPSHOT
 | 
			
		||||
docker pull intelanalytics/ipex-llm-serving-cpu:2.2.0-SNAPSHOT
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Available images in hub are: 
 | 
			
		||||
 | 
			
		||||
| Image Name | Description |
 | 
			
		||||
| --- | --- |
 | 
			
		||||
| intelanalytics/ipex-llm-cpu:2.2.0-SNAPSHOT | CPU Inference |
 | 
			
		||||
| intelanalytics/ipex-llm-xpu:2.2.0-SNAPSHOT | GPU Inference |
 | 
			
		||||
| intelanalytics/ipex-llm-serving-cpu:2.2.0-SNAPSHOT | CPU Serving|
 | 
			
		||||
| intelanalytics/ipex-llm-serving-xpu:2.2.0-SNAPSHOT | GPU Serving|
 | 
			
		||||
| intelanalytics/ipex-llm-serving-cpu:2.2.0-SNAPSHOT | CPU Inference & Serving|
 | 
			
		||||
| intelanalytics/ipex-llm-serving-xpu:2.2.0-SNAPSHOT | GPU Inference & Serving|
 | 
			
		||||
| intelanalytics/ipex-llm-inference-cpp-xpu:2.2.0-SNAPSHOT | Run llama.cpp/Ollama/Open-WebUI on GPU via Docker|
 | 
			
		||||
| intelanalytics/ipex-llm-finetune-qlora-xpu:2.2.0-SNAPSHOT| GPU Finetuning|
 | 
			
		||||
| intelanalytics/ipex-llm-finetune-qlora-cpu-standalone:2.2.0-SNAPSHOT | CPU Finetuning via Docker|
 | 
			
		||||
| intelanalytics/ipex-llm-finetune-qlora-cpu-k8s:2.2.0-SNAPSHOT|CPU Finetuning via Kubernetes|
 | 
			
		||||
| intelanalytics/ipex-llm-finetune-qlora-xpu:2.2.0-SNAPSHOT| GPU Finetuning|
 | 
			
		||||
 | 
			
		||||
#### Run a Container
 | 
			
		||||
Use `docker run` command to run an IPEX-LLM docker container. For detailed instructions, refer to the [IPEX-LLM Docker Container Guides](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/index.html).
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -1,57 +1,60 @@
 | 
			
		|||
FROM intel/oneapi-basekit:2024.0.1-devel-ubuntu22.04
 | 
			
		||||
 | 
			
		||||
ARG http_proxy
 | 
			
		||||
ARG https_proxy
 | 
			
		||||
ENV TZ=Asia/Shanghai
 | 
			
		||||
ARG PIP_NO_CACHE_DIR=false
 | 
			
		||||
 | 
			
		||||
# When cache is enabled SYCL runtime will try to cache and reuse JIT-compiled binaries. 
 | 
			
		||||
ENV SYCL_CACHE_PERSISTENT=1
 | 
			
		||||
ENV TZ=Asia/Shanghai SYCL_CACHE_PERSISTENT=1
 | 
			
		||||
 | 
			
		||||
# retrive oneapi repo public key
 | 
			
		||||
RUN wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | tee /usr/share/keyrings/intel-oneapi-archive-keyring.gpg > /dev/null && \
 | 
			
		||||
RUN set -eux && \
 | 
			
		||||
    # Set timezone
 | 
			
		||||
    ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone && \
 | 
			
		||||
    #
 | 
			
		||||
    # Retrieve Intel OneAPI and GPU repository keys
 | 
			
		||||
    wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | tee /usr/share/keyrings/intel-oneapi-archive-keyring.gpg > /dev/null && \	
 | 
			
		||||
    echo "deb [signed-by=/usr/share/keyrings/intel-oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main " | tee /etc/apt/sources.list.d/oneAPI.list && \	
 | 
			
		||||
    chmod 644 /usr/share/keyrings/intel-oneapi-archive-keyring.gpg && \	
 | 
			
		||||
    rm /etc/apt/sources.list.d/intel-graphics.list && \	
 | 
			
		||||
    wget -O- https://repositories.intel.com/graphics/intel-graphics.key | gpg --dearmor | tee /usr/share/keyrings/intel-graphics.gpg > /dev/null && \	
 | 
			
		||||
    echo "deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/graphics/ubuntu jammy arc" | tee /etc/apt/sources.list.d/intel.gpu.jammy.list && \	
 | 
			
		||||
    chmod 644 /usr/share/keyrings/intel-graphics.gpg && \
 | 
			
		||||
    # update dependencies
 | 
			
		||||
    #
 | 
			
		||||
    # Update package lists and install dependencies
 | 
			
		||||
    apt-get update && \
 | 
			
		||||
    # install basic dependencies
 | 
			
		||||
    apt-get install -y --no-install-recommends curl wget git libunwind8-dev vim less && \
 | 
			
		||||
    # install Intel GPU driver
 | 
			
		||||
    apt-get install -y --no-install-recommends intel-opencl-icd intel-level-zero-gpu level-zero level-zero-dev --allow-downgrades && \
 | 
			
		||||
    # install python 3.11
 | 
			
		||||
    ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone && \
 | 
			
		||||
    env DEBIAN_FRONTEND=noninteractive apt-get update && \
 | 
			
		||||
    # add-apt-repository requires gnupg, gpg-agent, software-properties-common
 | 
			
		||||
    apt-get install -y --no-install-recommends gnupg gpg-agent software-properties-common && \
 | 
			
		||||
    # Add Python 3.11 PPA repository
 | 
			
		||||
    apt-get install -y --no-install-recommends \
 | 
			
		||||
        curl wget git vim less libunwind8-dev \
 | 
			
		||||
        intel-opencl-icd intel-level-zero-gpu level-zero level-zero-dev \
 | 
			
		||||
        gnupg gpg-agent software-properties-common && \
 | 
			
		||||
    #
 | 
			
		||||
    # Add Python 3.11 PPA and install Python 3.11
 | 
			
		||||
    add-apt-repository ppa:deadsnakes/ppa -y && \
 | 
			
		||||
    apt-get install -y --no-install-recommends python3.11 python3-pip python3.11-dev python3-wheel python3.11-distutils && \
 | 
			
		||||
    # avoid axolotl lib conflict
 | 
			
		||||
    apt-get remove -y python3-blinker && apt autoremove -y && \
 | 
			
		||||
    # link to python 3.11
 | 
			
		||||
    rm /usr/bin/python3 && \
 | 
			
		||||
    ln -s /usr/bin/python3.11 /usr/bin/python3 && \
 | 
			
		||||
    ln -s /usr/bin/python3 /usr/bin/python && \
 | 
			
		||||
    # remove apt cache
 | 
			
		||||
    apt-get install -y --no-install-recommends \
 | 
			
		||||
        python3.11 python3-pip python3.11-dev python3-wheel python3.11-distutils && \
 | 
			
		||||
    #
 | 
			
		||||
    # Remove unnecessary packages and clean up
 | 
			
		||||
    apt-get remove -y python3-blinker && \
 | 
			
		||||
    apt-get autoremove -y && \
 | 
			
		||||
    rm -rf /var/lib/apt/lists/* && \
 | 
			
		||||
    # upgrade pip
 | 
			
		||||
    wget https://bootstrap.pypa.io/get-pip.py -O get-pip.py && \
 | 
			
		||||
    python3 get-pip.py && \
 | 
			
		||||
    # install XPU ipex-llm
 | 
			
		||||
    #
 | 
			
		||||
    # Set Python 3.11 as default
 | 
			
		||||
    ln -sf /usr/bin/python3.11 /usr/bin/python3 && \
 | 
			
		||||
    ln -sf /usr/bin/python3 /usr/bin/python && \
 | 
			
		||||
    #
 | 
			
		||||
    # Upgrade pip
 | 
			
		||||
    wget -qO /tmp/get-pip.py https://bootstrap.pypa.io/get-pip.py && \
 | 
			
		||||
    python3 /tmp/get-pip.py && \
 | 
			
		||||
    rm /tmp/get-pip.py && \
 | 
			
		||||
    #
 | 
			
		||||
    # Install Intel XPU ipex-llm and dependencies
 | 
			
		||||
    pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ && \
 | 
			
		||||
    # prepare finetune code and scripts
 | 
			
		||||
    pip install transformers==4.36.0 peft==0.10.0 datasets bitsandbytes scipy fire && \
 | 
			
		||||
    #
 | 
			
		||||
    # Clone finetuning scripts and setup configuration
 | 
			
		||||
    git clone https://github.com/intel-analytics/IPEX-LLM.git && \
 | 
			
		||||
    mv IPEX-LLM/python/llm/example/GPU/LLM-Finetuning /LLM-Finetuning && \
 | 
			
		||||
    rm -rf IPEX-LLM && \
 | 
			
		||||
    # install transformers & peft dependencies
 | 
			
		||||
    pip install transformers==4.36.0 && \
 | 
			
		||||
    pip install peft==0.10.0 datasets && \
 | 
			
		||||
    pip install bitsandbytes scipy fire && \
 | 
			
		||||
    # Prepare accelerate config
 | 
			
		||||
    mkdir -p /root/.cache/huggingface/accelerate && \
 | 
			
		||||
    mv /LLM-Finetuning/axolotl/default_config.yaml /root/.cache/huggingface/accelerate/
 | 
			
		||||
 | 
			
		||||
# Copy startup script
 | 
			
		||||
COPY ./start-qlora-finetuning-on-xpu.sh /start-qlora-finetuning-on-xpu.sh
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -1,20 +1,32 @@
 | 
			
		|||
# Stage 1: Build stage to handle file preparation
 | 
			
		||||
FROM ubuntu:22.04 as build
 | 
			
		||||
 | 
			
		||||
# Copy the files to the build image
 | 
			
		||||
COPY ./start-llama-cpp.sh ./start-ollama.sh ./benchmark_llama-cpp.sh /llm/scripts/
 | 
			
		||||
 | 
			
		||||
# Stage 2: Final image that only includes necessary runtime artifacts
 | 
			
		||||
FROM intel/oneapi-basekit:2025.0.2-0-devel-ubuntu22.04
 | 
			
		||||
 | 
			
		||||
# Copy the scripts from the build stage
 | 
			
		||||
COPY --from=build /llm/scripts /llm/scripts/
 | 
			
		||||
 | 
			
		||||
# Set build arguments for proxy
 | 
			
		||||
ARG http_proxy
 | 
			
		||||
ARG https_proxy
 | 
			
		||||
# Disable pip cache
 | 
			
		||||
ARG PIP_NO_CACHE_DIR=false
 | 
			
		||||
 | 
			
		||||
# Set environment variables
 | 
			
		||||
ENV TZ=Asia/Shanghai \
 | 
			
		||||
    PYTHONUNBUFFERED=1 \
 | 
			
		||||
    SYCL_CACHE_PERSISTENT=1
 | 
			
		||||
 | 
			
		||||
# Disable pip cache
 | 
			
		||||
ARG PIP_NO_CACHE_DIR=false
 | 
			
		||||
 | 
			
		||||
# Install dependencies and configure the environment
 | 
			
		||||
RUN set -eux && \
 | 
			
		||||
    \
 | 
			
		||||
    #
 | 
			
		||||
    # Ensure scripts are executable
 | 
			
		||||
    chmod +x /llm/scripts/*.sh && \
 | 
			
		||||
    # 
 | 
			
		||||
    # Configure Intel OneAPI and GPU repositories
 | 
			
		||||
    wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | tee /usr/share/keyrings/intel-oneapi-archive-keyring.gpg > /dev/null && \
 | 
			
		||||
    echo "deb [signed-by=/usr/share/keyrings/intel-oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | tee /etc/apt/sources.list.d/oneAPI.list && \
 | 
			
		||||
| 
						 | 
				
			
			@ -23,32 +35,32 @@ RUN set -eux && \
 | 
			
		|||
    wget -O- https://repositories.intel.com/graphics/intel-graphics.key | gpg --dearmor | tee /usr/share/keyrings/intel-graphics.gpg > /dev/null && \
 | 
			
		||||
    echo "deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/graphics/ubuntu jammy arc" | tee /etc/apt/sources.list.d/intel.gpu.jammy.list && \
 | 
			
		||||
    chmod 644 /usr/share/keyrings/intel-graphics.gpg && \
 | 
			
		||||
    \
 | 
			
		||||
    # 
 | 
			
		||||
    # Update and install basic dependencies
 | 
			
		||||
    apt-get update && \
 | 
			
		||||
    apt-get install -y --no-install-recommends \
 | 
			
		||||
      curl wget git sudo libunwind8-dev vim less gnupg gpg-agent software-properties-common && \
 | 
			
		||||
    \
 | 
			
		||||
    # 
 | 
			
		||||
    # Set timezone
 | 
			
		||||
    ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && \
 | 
			
		||||
    echo $TZ > /etc/timezone && \
 | 
			
		||||
    \
 | 
			
		||||
    # 
 | 
			
		||||
    # Install Python 3.11
 | 
			
		||||
    add-apt-repository ppa:deadsnakes/ppa -y && \
 | 
			
		||||
    apt-get install -y --no-install-recommends python3.11 python3-pip python3.11-dev python3.11-distutils python3-wheel && \
 | 
			
		||||
    rm /usr/bin/python3 && ln -s /usr/bin/python3.11 /usr/bin/python3 && \
 | 
			
		||||
    ln -s /usr/bin/python3 /usr/bin/python && \
 | 
			
		||||
    \
 | 
			
		||||
    # 
 | 
			
		||||
    # Install pip and essential Python packages
 | 
			
		||||
    wget https://bootstrap.pypa.io/get-pip.py -O get-pip.py && \
 | 
			
		||||
    python3 get-pip.py && rm get-pip.py && \
 | 
			
		||||
    pip install --upgrade requests argparse urllib3 && \
 | 
			
		||||
    pip install --pre --upgrade ipex-llm[cpp] && \
 | 
			
		||||
    pip install transformers==4.36.2 transformers_stream_generator einops tiktoken && \
 | 
			
		||||
    \
 | 
			
		||||
    # 
 | 
			
		||||
    # Remove breaks install packages
 | 
			
		||||
    apt-get remove -y libze-dev libze-intel-gpu1 && \
 | 
			
		||||
    \
 | 
			
		||||
    # 
 | 
			
		||||
    # Install Intel GPU OpenCL Driver and Compute Runtime
 | 
			
		||||
    mkdir -p /tmp/gpu && cd /tmp/gpu && \
 | 
			
		||||
    echo "Downloading Intel Compute Runtime (24.52) for Gen12+..." && \
 | 
			
		||||
| 
						 | 
				
			
			@ -57,29 +69,21 @@ RUN set -eux && \
 | 
			
		|||
    wget https://github.com/intel/compute-runtime/releases/download/24.52.32224.5/intel-level-zero-gpu_1.6.32224.5_amd64.deb && \
 | 
			
		||||
    wget https://github.com/intel/compute-runtime/releases/download/24.52.32224.5/intel-opencl-icd_24.52.32224.5_amd64.deb && \
 | 
			
		||||
    wget https://github.com/intel/compute-runtime/releases/download/24.52.32224.5/libigdgmm12_22.5.5_amd64.deb && \
 | 
			
		||||
    \
 | 
			
		||||
    # 
 | 
			
		||||
    echo "Downloading Legacy Compute Runtime (24.35) for pre-Gen12 support..." && \
 | 
			
		||||
    wget https://github.com/intel/compute-runtime/releases/download/24.35.30872.22/intel-level-zero-gpu-legacy1_1.3.30872.22_amd64.deb && \
 | 
			
		||||
    wget https://github.com/intel/compute-runtime/releases/download/24.35.30872.22/intel-opencl-icd-legacy1_24.35.30872.22_amd64.deb && \
 | 
			
		||||
    wget https://github.com/intel/intel-graphics-compiler/releases/download/igc-1.0.17537.20/intel-igc-core_1.0.17537.20_amd64.deb && \
 | 
			
		||||
    wget https://github.com/intel/intel-graphics-compiler/releases/download/igc-1.0.17537.20/intel-igc-opencl_1.0.17537.20_amd64.deb && \
 | 
			
		||||
    \
 | 
			
		||||
    # 
 | 
			
		||||
    dpkg -i *.deb && rm -rf /tmp/gpu && \
 | 
			
		||||
    \
 | 
			
		||||
    # 
 | 
			
		||||
    # Install oneAPI Level Zero Loader
 | 
			
		||||
    mkdir /tmp/level-zero && cd /tmp/level-zero && \
 | 
			
		||||
    wget https://github.com/oneapi-src/level-zero/releases/download/v1.20.2/level-zero_1.20.2+u22.04_amd64.deb && \
 | 
			
		||||
    wget https://github.com/oneapi-src/level-zero/releases/download/v1.20.2/level-zero-devel_1.20.2+u22.04_amd64.deb && \
 | 
			
		||||
    dpkg -i *.deb && rm -rf /tmp/level-zero && \
 | 
			
		||||
    \
 | 
			
		||||
    # 
 | 
			
		||||
    # Clean up unnecessary dependencies to reduce image size
 | 
			
		||||
    find /usr/lib/python3/dist-packages/ -name 'blinker*' -exec rm -rf {} + && \
 | 
			
		||||
    rm -rf /root/.cache/Cypress && \
 | 
			
		||||
    \
 | 
			
		||||
    # Create necessary directories
 | 
			
		||||
    mkdir -p /llm/scripts
 | 
			
		||||
 | 
			
		||||
# Copy startup scripts
 | 
			
		||||
COPY ./start-llama-cpp.sh /llm/scripts/
 | 
			
		||||
COPY ./start-ollama.sh /llm/scripts/
 | 
			
		||||
COPY ./benchmark_llama-cpp.sh /llm/scripts/
 | 
			
		||||
    rm -rf /root/.cache/Cypress
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -1,71 +0,0 @@
 | 
			
		|||
FROM ubuntu:22.04
 | 
			
		||||
 | 
			
		||||
ARG http_proxy
 | 
			
		||||
ARG https_proxy
 | 
			
		||||
ARG PIP_NO_CACHE_DIR=false
 | 
			
		||||
ARG DEBIAN_FRONTEND=noninteractive
 | 
			
		||||
 | 
			
		||||
ENV PYTHONUNBUFFERED=1
 | 
			
		||||
 | 
			
		||||
COPY ./start-notebook.sh /llm/start-notebook.sh
 | 
			
		||||
 | 
			
		||||
# Update the software sources
 | 
			
		||||
RUN env DEBIAN_FRONTEND=noninteractive apt-get update && \
 | 
			
		||||
# Install essential packages
 | 
			
		||||
    apt-get install -y --no-install-recommends libunwind8-dev vim less && \
 | 
			
		||||
# Install git, curl, and wget
 | 
			
		||||
    apt-get install -y --no-install-recommends git curl wget && \
 | 
			
		||||
# Install Python 3.11
 | 
			
		||||
    # add-apt-repository requires gnupg, gpg-agent, software-properties-common
 | 
			
		||||
    apt-get install -y --no-install-recommends gnupg gpg-agent software-properties-common && \
 | 
			
		||||
    # Add Python 3.11 PPA repository
 | 
			
		||||
    add-apt-repository ppa:deadsnakes/ppa -y && \
 | 
			
		||||
    # Install Python 3.11
 | 
			
		||||
    apt-get install -y --no-install-recommends python3.11 && \
 | 
			
		||||
    # Install Python 3.11 development and utility packages
 | 
			
		||||
    apt-get install -y --no-install-recommends python3-pip python3.11-dev python3-wheel python3.11-distutils && \
 | 
			
		||||
    # Remove the original /usr/bin/python3 symbolic link
 | 
			
		||||
    rm /usr/bin/python3 && \
 | 
			
		||||
    # Create a symbolic link pointing to Python 3.11 at /usr/bin/python3
 | 
			
		||||
    ln -s /usr/bin/python3.11 /usr/bin/python3 && \
 | 
			
		||||
    # Create a symbolic link pointing to /usr/bin/python3 at /usr/bin/python
 | 
			
		||||
    ln -s /usr/bin/python3 /usr/bin/python && \
 | 
			
		||||
# Download and install pip, install FastChat from source requires PEP 660 support
 | 
			
		||||
    curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && \
 | 
			
		||||
    python3 get-pip.py && \
 | 
			
		||||
    rm get-pip.py && \
 | 
			
		||||
    pip install --upgrade requests argparse urllib3 && \
 | 
			
		||||
# Download ipex-llm-tutorial
 | 
			
		||||
    pip install --upgrade jupyterlab && \
 | 
			
		||||
    git clone https://github.com/intel-analytics/ipex-llm-tutorial /llm/ipex-llm-tutorial && \
 | 
			
		||||
    chmod +x /llm/start-notebook.sh && \
 | 
			
		||||
# Download all-in-one benchmark
 | 
			
		||||
    git clone https://github.com/intel-analytics/IPEX-LLM && \
 | 
			
		||||
    cp -r ./IPEX-LLM/python/llm/dev/benchmark/ /llm/benchmark && \
 | 
			
		||||
# Copy chat.py script
 | 
			
		||||
    pip install --upgrade colorama && \
 | 
			
		||||
    cp -r ./IPEX-LLM/python/llm/portable-zip/ /llm/portable-zip && \
 | 
			
		||||
# Install all-in-one dependencies
 | 
			
		||||
    apt-get install -y --no-install-recommends numactl && \
 | 
			
		||||
    pip install --upgrade omegaconf && \
 | 
			
		||||
    pip install --upgrade pandas && \
 | 
			
		||||
# Install vllm dependencies
 | 
			
		||||
    pip install --upgrade fastapi && \
 | 
			
		||||
    pip install --upgrade "uvicorn[standard]" && \
 | 
			
		||||
# Add Qwen support
 | 
			
		||||
    pip install --upgrade transformers_stream_generator einops && \
 | 
			
		||||
# Copy vLLM-Serving
 | 
			
		||||
    cp -r ./IPEX-LLM/python/llm/example/CPU/vLLM-Serving/ /llm/vLLM-Serving && \
 | 
			
		||||
    rm -rf ./IPEX-LLM && \
 | 
			
		||||
# Fix vllm service 
 | 
			
		||||
    pip install pydantic==1.10.11 && \
 | 
			
		||||
# Install ipex-llm
 | 
			
		||||
    pip install --pre --upgrade ipex-llm[all] && \
 | 
			
		||||
    # Fix CVE-2024-22195
 | 
			
		||||
    pip install Jinja2==3.1.3 && \
 | 
			
		||||
    pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cpu && \
 | 
			
		||||
    pip install intel-extension-for-pytorch==2.2.0 && \
 | 
			
		||||
    pip install oneccl_bind_pt==2.2.0 --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/cpu/cn/ && \
 | 
			
		||||
    pip install transformers==4.36.2
 | 
			
		||||
 | 
			
		||||
ENTRYPOINT ["/bin/bash"]
 | 
			
		||||
| 
						 | 
				
			
			@ -1,68 +0,0 @@
 | 
			
		|||
## Build/Use IPEX-LLM cpu image
 | 
			
		||||
 | 
			
		||||
### Build Image
 | 
			
		||||
```bash
 | 
			
		||||
docker build \
 | 
			
		||||
  --build-arg http_proxy=.. \
 | 
			
		||||
  --build-arg https_proxy=.. \
 | 
			
		||||
  --build-arg no_proxy=.. \
 | 
			
		||||
  --rm --no-cache -t intelanalytics/ipex-llm-cpu:2.2.0-SNAPSHOT .
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
### Use the image for doing cpu inference
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
An example could be:
 | 
			
		||||
```bash
 | 
			
		||||
#/bin/bash
 | 
			
		||||
export DOCKER_IMAGE=intelanalytics/ipex-llm-cpu:2.2.0-SNAPSHOT
 | 
			
		||||
 | 
			
		||||
sudo docker run -itd \
 | 
			
		||||
        --net=host \
 | 
			
		||||
        --cpuset-cpus="0-47" \
 | 
			
		||||
        --cpuset-mems="0" \
 | 
			
		||||
        --memory="32G" \
 | 
			
		||||
        --name=CONTAINER_NAME \
 | 
			
		||||
        --shm-size="16g" \
 | 
			
		||||
        $DOCKER_IMAGE
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
After the container is booted, you could get into the container through `docker exec`.
 | 
			
		||||
 | 
			
		||||
To run inference using `IPEX-LLM` using cpu, you could refer to this [documentation](https://github.com/intel-analytics/IPEX-LLM/tree/main/python/llm#cpu-int4).
 | 
			
		||||
 | 
			
		||||
### Use chat.py
 | 
			
		||||
 | 
			
		||||
chat.py can be used to initiate a conversation with a specified model. The file is under directory '/llm'.
 | 
			
		||||
 | 
			
		||||
You can download models and bind the model directory from host machine to container when start a container.
 | 
			
		||||
 | 
			
		||||
Here is an example:
 | 
			
		||||
```bash
 | 
			
		||||
export DOCKER_IMAGE=intelanalytics/ipex-llm-cpu:2.2.0-SNAPSHOT
 | 
			
		||||
export MODEL_PATH=/home/llm/models
 | 
			
		||||
 | 
			
		||||
sudo docker run -itd \
 | 
			
		||||
        --net=host \
 | 
			
		||||
        --cpuset-cpus="0-47" \
 | 
			
		||||
        --cpuset-mems="0" \
 | 
			
		||||
        --memory="32G" \
 | 
			
		||||
        --name=CONTAINER_NAME \
 | 
			
		||||
        --shm-size="16g" \
 | 
			
		||||
        -v $MODEL_PATH:/llm/models/
 | 
			
		||||
        $DOCKER_IMAGE
 | 
			
		||||
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
After entering the container through `docker exec`, you can run chat.py by:
 | 
			
		||||
```bash
 | 
			
		||||
cd /llm
 | 
			
		||||
python chat.py --model-path YOUR_MODEL_PATH
 | 
			
		||||
```
 | 
			
		||||
In the example above, it can be:
 | 
			
		||||
```bash
 | 
			
		||||
cd /llm
 | 
			
		||||
python chat.py --model-path /llm/models/MODEL_NAME
 | 
			
		||||
```
 | 
			
		||||
| 
						 | 
				
			
			@ -1,97 +0,0 @@
 | 
			
		|||
FROM intel/oneapi:2024.2.1-0-devel-ubuntu22.04
 | 
			
		||||
 | 
			
		||||
ARG http_proxy
 | 
			
		||||
ARG https_proxy
 | 
			
		||||
 | 
			
		||||
ENV TZ=Asia/Shanghai
 | 
			
		||||
ENV PYTHONUNBUFFERED=1
 | 
			
		||||
 | 
			
		||||
# When cache is enabled SYCL runtime will try to cache and reuse JIT-compiled binaries.
 | 
			
		||||
ENV SYCL_CACHE_PERSISTENT=1
 | 
			
		||||
 | 
			
		||||
COPY chat.py /llm/chat.py
 | 
			
		||||
COPY benchmark.sh /llm/benchmark.sh
 | 
			
		||||
 | 
			
		||||
# Disable pip's cache behavior
 | 
			
		||||
ARG PIP_NO_CACHE_DIR=false
 | 
			
		||||
 | 
			
		||||
RUN wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | tee /usr/share/keyrings/intel-oneapi-archive-keyring.gpg > /dev/null && \
 | 
			
		||||
    echo "deb [signed-by=/usr/share/keyrings/intel-oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main " | tee /etc/apt/sources.list.d/oneAPI.list && \
 | 
			
		||||
    chmod 644 /usr/share/keyrings/intel-oneapi-archive-keyring.gpg && \
 | 
			
		||||
    rm /etc/apt/sources.list.d/intel-graphics.list && \
 | 
			
		||||
    wget -O- https://repositories.intel.com/graphics/intel-graphics.key | gpg --dearmor | tee /usr/share/keyrings/intel-graphics.gpg > /dev/null && \
 | 
			
		||||
    echo "deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/graphics/ubuntu jammy arc" | tee /etc/apt/sources.list.d/intel.gpu.jammy.list && \
 | 
			
		||||
    chmod 644 /usr/share/keyrings/intel-graphics.gpg && \
 | 
			
		||||
    apt-get update && \
 | 
			
		||||
    apt-get install -y --no-install-recommends curl wget git libunwind8-dev vim less && \
 | 
			
		||||
    # Install PYTHON 3.11 and IPEX-LLM[xpu]
 | 
			
		||||
    ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone && \
 | 
			
		||||
    env DEBIAN_FRONTEND=noninteractive apt-get update && \
 | 
			
		||||
    # add-apt-repository requires gnupg, gpg-agent, software-properties-common
 | 
			
		||||
    apt-get install -y --no-install-recommends gnupg gpg-agent software-properties-common && \
 | 
			
		||||
    export PRE_DIR=$(pwd) && \
 | 
			
		||||
    # Install Compute Runtime
 | 
			
		||||
    mkdir -p /tmp/neo && \
 | 
			
		||||
    cd /tmp/neo && \
 | 
			
		||||
    wget https://github.com/oneapi-src/level-zero/releases/download/v1.18.5/level-zero_1.18.5+u22.04_amd64.deb && \
 | 
			
		||||
    wget https://github.com/intel/intel-graphics-compiler/releases/download/igc-1.0.17791.9/intel-igc-core_1.0.17791.9_amd64.deb && \
 | 
			
		||||
    wget https://github.com/intel/intel-graphics-compiler/releases/download/igc-1.0.17791.9/intel-igc-opencl_1.0.17791.9_amd64.deb && \
 | 
			
		||||
    wget https://github.com/intel/compute-runtime/releases/download/24.39.31294.12/intel-level-zero-gpu_1.6.31294.12_amd64.deb && \
 | 
			
		||||
    wget https://github.com/intel/compute-runtime/releases/download/24.39.31294.12/intel-opencl-icd_24.39.31294.12_amd64.deb && \
 | 
			
		||||
    wget https://github.com/intel/compute-runtime/releases/download/24.39.31294.12/libigdgmm12_22.5.2_amd64.deb && \
 | 
			
		||||
    dpkg -i *.deb && \
 | 
			
		||||
    rm -rf /tmp/neo && \
 | 
			
		||||
    cd $PRE_DIR && \
 | 
			
		||||
    # Add Python 3.11 PPA repository
 | 
			
		||||
    add-apt-repository ppa:deadsnakes/ppa -y && \
 | 
			
		||||
    apt-get install -y --no-install-recommends python3.11 git curl wget && \
 | 
			
		||||
    rm /usr/bin/python3 && \
 | 
			
		||||
    ln -s /usr/bin/python3.11 /usr/bin/python3 && \
 | 
			
		||||
    ln -s /usr/bin/python3 /usr/bin/python && \
 | 
			
		||||
    apt-get install -y --no-install-recommends python3-pip python3.11-dev python3-wheel python3.11-distutils && \
 | 
			
		||||
    wget https://bootstrap.pypa.io/get-pip.py -O get-pip.py && \
 | 
			
		||||
    # Install FastChat from source requires PEP 660 support
 | 
			
		||||
    python3 get-pip.py && \
 | 
			
		||||
    rm get-pip.py && \
 | 
			
		||||
    pip install --upgrade requests argparse urllib3 && \
 | 
			
		||||
    pip install --pre --upgrade ipex-llm[xpu_arc] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/ && \
 | 
			
		||||
    pip install --pre pytorch-triton-xpu==3.0.0+1b2f15840e --index-url https://download.pytorch.org/whl/nightly/xpu && \
 | 
			
		||||
    # Fix Trivy CVE Issues
 | 
			
		||||
    pip install transformers_stream_generator einops tiktoken && \
 | 
			
		||||
    # Install opencl-related repos
 | 
			
		||||
    apt-get update && \
 | 
			
		||||
    # Install related libary of chat.py
 | 
			
		||||
    pip install --upgrade colorama && \
 | 
			
		||||
    # Download all-in-one benchmark and examples
 | 
			
		||||
    git clone https://github.com/intel-analytics/ipex-llm && \
 | 
			
		||||
    cp -r ./ipex-llm/python/llm/dev/benchmark/ ./benchmark && \
 | 
			
		||||
    cp -r ./ipex-llm/python/llm/example/GPU/HuggingFace/LLM ./examples && \
 | 
			
		||||
    # Install vllm dependencies
 | 
			
		||||
    pip install --upgrade fastapi && \
 | 
			
		||||
    pip install --upgrade "uvicorn[standard]" && \
 | 
			
		||||
    # Download vLLM-Serving
 | 
			
		||||
    cp -r ./ipex-llm/python/llm/example/GPU/vLLM-Serving/ ./vLLM-Serving && \
 | 
			
		||||
    # Download pp_serving
 | 
			
		||||
    mkdir -p /llm/pp_serving && \
 | 
			
		||||
    cp ./ipex-llm/python/llm/example/GPU/Pipeline-Parallel-Serving/*.py /llm/pp_serving/ && \
 | 
			
		||||
    # Download lightweight_serving
 | 
			
		||||
    mkdir -p /llm/lightweight_serving && \
 | 
			
		||||
    cp ./ipex-llm/python/llm/example/GPU/Lightweight-Serving/*.py /llm/lightweight_serving/ && \
 | 
			
		||||
    # Install related library of benchmarking
 | 
			
		||||
    pip install pandas omegaconf && \
 | 
			
		||||
    chmod +x /llm/benchmark.sh && \
 | 
			
		||||
    # Download Deepspeed-AutoTP
 | 
			
		||||
    cp -r ./ipex-llm/python/llm/example/GPU/Deepspeed-AutoTP/ ./Deepspeed-AutoTP && \
 | 
			
		||||
    # Install related library of Deepspeed-AutoTP
 | 
			
		||||
    pip install oneccl_bind_pt==2.3.100 --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/ && \
 | 
			
		||||
    pip install git+https://github.com/microsoft/DeepSpeed.git@ed8aed5 && \
 | 
			
		||||
    pip install git+https://github.com/intel/intel-extension-for-deepspeed.git@0eb734b && \
 | 
			
		||||
    pip install mpi4py && \
 | 
			
		||||
    apt-get update && \
 | 
			
		||||
    apt-get install -y --no-install-recommends google-perftools && \
 | 
			
		||||
    ln -s /usr/local/lib/python3.11/dist-packages/ipex_llm/libs/libtcmalloc.so /lib/libtcmalloc.so && \
 | 
			
		||||
    rm -rf ./ipex-llm
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
WORKDIR /llm/
 | 
			
		||||
ENV BIGDL_CHECK_DUPLICATE_IMPORT=0
 | 
			
		||||
| 
						 | 
				
			
			@ -1,45 +0,0 @@
 | 
			
		|||
## Build/Use IPEX-LLM xpu image
 | 
			
		||||
 | 
			
		||||
### Build Image
 | 
			
		||||
```bash
 | 
			
		||||
docker build \
 | 
			
		||||
  --build-arg http_proxy=.. \
 | 
			
		||||
  --build-arg https_proxy=.. \
 | 
			
		||||
  --build-arg no_proxy=.. \
 | 
			
		||||
  --rm --no-cache -t intelanalytics/ipex-llm-xpu:2.2.0-SNAPSHOT .
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
### Use the image for doing xpu inference
 | 
			
		||||
 | 
			
		||||
To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container.
 | 
			
		||||
 | 
			
		||||
An example could be:
 | 
			
		||||
```bash
 | 
			
		||||
#/bin/bash
 | 
			
		||||
export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:2.2.0-SNAPSHOT
 | 
			
		||||
 | 
			
		||||
sudo docker run -itd \
 | 
			
		||||
        --net=host \
 | 
			
		||||
        --device=/dev/dri \
 | 
			
		||||
        --memory="32G" \
 | 
			
		||||
        --name=CONTAINER_NAME \
 | 
			
		||||
        --shm-size="16g" \
 | 
			
		||||
        $DOCKER_IMAGE
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
After the container is booted, you could get into the container through `docker exec`.
 | 
			
		||||
 | 
			
		||||
To verify the device is successfully mapped into the container, run `sycl-ls` to check the result. In a machine with Arc A770, the sampled output is:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
root@arda-arc12:/# sycl-ls
 | 
			
		||||
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
 | 
			
		||||
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K 3.0 [2023.16.7.0.21_160000]
 | 
			
		||||
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33]
 | 
			
		||||
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
To run inference using `IPEX-LLM` using xpu, you could refer to this [documentation](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU).
 | 
			
		||||
| 
						 | 
				
			
			@ -1,29 +1,93 @@
 | 
			
		|||
FROM intelanalytics/ipex-llm-cpu:2.2.0-SNAPSHOT
 | 
			
		||||
# Stage 1: Build stage to handle file preparation
 | 
			
		||||
FROM ubuntu:22.04 as build
 | 
			
		||||
 | 
			
		||||
# Copy the files to the build image
 | 
			
		||||
COPY ./start-notebook.sh               /llm/
 | 
			
		||||
COPY ./model_adapter.py.patch          /llm/
 | 
			
		||||
COPY ./vllm_offline_inference.py       /llm/
 | 
			
		||||
COPY ./payload-1024.lua                /llm/
 | 
			
		||||
COPY ./start-vllm-service.sh           /llm/
 | 
			
		||||
COPY ./benchmark_vllm_throughput.py    /llm/
 | 
			
		||||
COPY ./start-fastchat-service.sh       /llm/
 | 
			
		||||
 | 
			
		||||
# Stage 2: Final image that only includes necessary runtime artifacts
 | 
			
		||||
FROM ubuntu:22.04
 | 
			
		||||
 | 
			
		||||
# Copy the scripts from the build stage
 | 
			
		||||
COPY --from=build /llm /llm/
 | 
			
		||||
 | 
			
		||||
ARG http_proxy
 | 
			
		||||
ARG https_proxy
 | 
			
		||||
ARG TINI_VERSION=v0.18.0
 | 
			
		||||
 | 
			
		||||
# Disable pip's cache behavior
 | 
			
		||||
ARG PIP_NO_CACHE_DIR=false
 | 
			
		||||
ARG DEBIAN_FRONTEND=noninteractive
 | 
			
		||||
 | 
			
		||||
COPY ./model_adapter.py.patch /llm/model_adapter.py.patch
 | 
			
		||||
ENV PYTHONUNBUFFERED=1
 | 
			
		||||
 | 
			
		||||
# Install Serving Dependencies
 | 
			
		||||
RUN wget -qO /sbin/tini https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini && \
 | 
			
		||||
    chmod +x /sbin/tini && \
 | 
			
		||||
    cd /llm && \
 | 
			
		||||
    apt-get update && \
 | 
			
		||||
    apt-get install -y --no-install-recommends wrk patch g++ && \
 | 
			
		||||
RUN apt-get update && apt-get install -y --no-install-recommends \
 | 
			
		||||
    # Install basic utilities
 | 
			
		||||
    libunwind8-dev vim less \
 | 
			
		||||
    # Version control and download tools
 | 
			
		||||
    git curl wget \
 | 
			
		||||
    # add-apt-repository requires gnupg, gpg-agent, software-properties-common
 | 
			
		||||
    gnupg gpg-agent software-properties-common \
 | 
			
		||||
    # Install performance testing tool, NUMA (Non-Uniform Memory Access) support, and patch tool
 | 
			
		||||
    wrk numactl patch && \
 | 
			
		||||
# Install Python 3.11
 | 
			
		||||
    # Add Python 3.11 PPA repository
 | 
			
		||||
    add-apt-repository ppa:deadsnakes/ppa -y && \
 | 
			
		||||
    # Install Python 3.11 and related packages
 | 
			
		||||
    apt-get update && apt-get install -y --no-install-recommends python3.11 python3-pip python3.11-dev python3-wheel python3.11-distutils && \
 | 
			
		||||
    # Remove the original /usr/bin/python3 symbolic link
 | 
			
		||||
    rm /usr/bin/python3 && \
 | 
			
		||||
    # Create a symbolic link pointing to Python 3.11 at /usr/bin/python3
 | 
			
		||||
    ln -s /usr/bin/python3.11 /usr/bin/python3 && \
 | 
			
		||||
    # Create a symbolic link pointing to /usr/bin/python3 at /usr/bin/python
 | 
			
		||||
    ln -s /usr/bin/python3 /usr/bin/python && \
 | 
			
		||||
# Download and install pip, install FastChat from source requires PEP 660 support
 | 
			
		||||
    curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && \
 | 
			
		||||
    python3 get-pip.py && \
 | 
			
		||||
    rm get-pip.py && \
 | 
			
		||||
# Install Basic Python utilities
 | 
			
		||||
    pip install --upgrade requests argparse urllib3 && \
 | 
			
		||||
# Download ipex-llm-tutorial
 | 
			
		||||
    pip install --upgrade jupyterlab && \
 | 
			
		||||
    git clone https://github.com/intel-analytics/ipex-llm-tutorial /llm/ipex-llm-tutorial && \
 | 
			
		||||
    chmod +x /llm/start-notebook.sh && \
 | 
			
		||||
# Download all-in-one benchmark
 | 
			
		||||
    git clone https://github.com/intel-analytics/IPEX-LLM && \
 | 
			
		||||
    cp -r ./IPEX-LLM/python/llm/dev/benchmark/ /llm/benchmark && \
 | 
			
		||||
# Copy chat.py script
 | 
			
		||||
    pip install --upgrade colorama && \
 | 
			
		||||
    cp -r ./IPEX-LLM/python/llm/portable-zip/ /llm/portable-zip && \
 | 
			
		||||
# Install all-in-one dependencies
 | 
			
		||||
    pip install --upgrade omegaconf && \
 | 
			
		||||
    pip install --upgrade pandas && \
 | 
			
		||||
# Install ipex-llm
 | 
			
		||||
    pip install --pre --upgrade ipex-llm[serving] && \
 | 
			
		||||
    apt-get install -y gcc-12 g++-12 libnuma-dev && \
 | 
			
		||||
    update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12 && \
 | 
			
		||||
    # Fix Trivy CVE Issues
 | 
			
		||||
    pip install Jinja2==3.1.3 transformers==4.36.2 gradio==4.19.2 cryptography==42.0.4 && \
 | 
			
		||||
    # Fix CVE-2024-22195
 | 
			
		||||
    pip install Jinja2==3.1.3 && \
 | 
			
		||||
    pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cpu && \
 | 
			
		||||
    pip install intel-extension-for-pytorch==2.2.0 && \
 | 
			
		||||
    pip install oneccl_bind_pt==2.2.0 --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/cpu/cn/ && \
 | 
			
		||||
    pip install transformers==4.36.2 && \
 | 
			
		||||
# Install vllm dependencies
 | 
			
		||||
    pip install --upgrade fastapi && \
 | 
			
		||||
    pip install --upgrade "uvicorn[standard]" && \
 | 
			
		||||
# Add Qwen support
 | 
			
		||||
    pip install --upgrade transformers_stream_generator einops && \
 | 
			
		||||
# Fix Qwen model adapter in fastchat
 | 
			
		||||
    patch /usr/local/lib/python3.11/dist-packages/fastchat/model/model_adapter.py < /llm/model_adapter.py.patch && \
 | 
			
		||||
    cp /sbin/tini /usr/bin/tini && \
 | 
			
		||||
# Copy vLLM-Serving
 | 
			
		||||
    cp -r ./IPEX-LLM/python/llm/example/CPU/vLLM-Serving/ /llm/vLLM-Serving && \
 | 
			
		||||
    rm -rf ./IPEX-LLM && \
 | 
			
		||||
# Fix vllm service 
 | 
			
		||||
    pip install pydantic==1.10.11 && \
 | 
			
		||||
# Install vllm
 | 
			
		||||
    apt-get install -y g++ gcc-12 g++-12 libnuma-dev && \
 | 
			
		||||
    update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12 && \
 | 
			
		||||
    git clone https://github.com/vllm-project/vllm.git && \
 | 
			
		||||
    cd ./vllm && \
 | 
			
		||||
    git checkout v0.6.6.post1 && \
 | 
			
		||||
| 
						 | 
				
			
			@ -31,13 +95,8 @@ RUN wget -qO /sbin/tini https://github.com/krallin/tini/releases/download/${TINI
 | 
			
		|||
    pip uninstall -y intel-extension-for-pytorch && \
 | 
			
		||||
    pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu && \
 | 
			
		||||
    VLLM_TARGET_DEVICE=cpu python3 setup.py install && \
 | 
			
		||||
    pip install ray
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
COPY ./vllm_offline_inference.py       /llm/
 | 
			
		||||
COPY ./payload-1024.lua                /llm/
 | 
			
		||||
COPY ./start-vllm-service.sh           /llm/
 | 
			
		||||
COPY ./benchmark_vllm_throughput.py    /llm/
 | 
			
		||||
COPY ./start-fastchat-service.sh       /llm/
 | 
			
		||||
    pip install ray && \
 | 
			
		||||
# Clean up unnecessary files to reduce image size
 | 
			
		||||
    rm -rf /var/lib/apt/lists/*
 | 
			
		||||
 | 
			
		||||
WORKDIR /llm/
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -1,6 +1,14 @@
 | 
			
		|||
## Build/Use IPEX-LLM-serving cpu image
 | 
			
		||||
# IPEX-LLM-Serving CPU Image: Build and Usage Guide
 | 
			
		||||
 | 
			
		||||
This document provides instructions for building and using the `IPEX-LLM-serving` CPU Docker image, including model inference, serving, and benchmarking functionalities.
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 1. Build the Image  
 | 
			
		||||
 | 
			
		||||
To build the `ipex-llm-serving-cpu` Docker image, run the following command:  
 | 
			
		||||
 | 
			
		||||
### Build Image
 | 
			
		||||
```bash
 | 
			
		||||
docker build \
 | 
			
		||||
  --build-arg http_proxy=.. \
 | 
			
		||||
| 
						 | 
				
			
			@ -9,74 +17,127 @@ docker build \
 | 
			
		|||
  --rm --no-cache -t intelanalytics/ipex-llm-serving-cpu:2.2.0-SNAPSHOT .
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
### Use the image for doing cpu serving
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 2. Run the Container  
 | 
			
		||||
 | 
			
		||||
You could use the following bash script to start the container.  Please be noted that the CPU config is specified for Xeon CPUs, change it accordingly if you are not using a Xeon CPU.
 | 
			
		||||
Before running `chat.py` or using serving functionalities, start the container using the following command.  
 | 
			
		||||
 | 
			
		||||
### **Step 1: Download the Model (Optional)**  
 | 
			
		||||
 | 
			
		||||
If using a local model, download it to your host machine and bind the directory to the container when launching it.  
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
export MODEL_PATH=/home/llm/models  # Change this to your model directory
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
This ensures the container has access to the necessary models.  
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
### **Step 2: Start the Container**  
 | 
			
		||||
 | 
			
		||||
Use the following command to start the container:  
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
#/bin/bash
 | 
			
		||||
export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-cpu:2.2.0-SNAPSHOT
 | 
			
		||||
 | 
			
		||||
sudo docker run -itd \
 | 
			
		||||
        --net=host \
 | 
			
		||||
        --cpuset-cpus="0-47" \
 | 
			
		||||
        --cpuset-mems="0" \
 | 
			
		||||
        --memory="32G" \
 | 
			
		||||
        --net=host \  # Use host networking for performance
 | 
			
		||||
        --cpuset-cpus="0-47" \  # Limit the container to specific CPU cores
 | 
			
		||||
        --cpuset-mems="0" \  # Bind the container to NUMA node 0 for memory locality
 | 
			
		||||
        --memory="32G" \  # Limit memory usage to 32GB
 | 
			
		||||
        --shm-size="16g" \  # Set shared memory size to 16GB (useful for large models)
 | 
			
		||||
        --name=CONTAINER_NAME \
 | 
			
		||||
        --shm-size="16g" \
 | 
			
		||||
        -v $MODEL_PATH:/llm/models/ \  # Mount the model directory
 | 
			
		||||
        $DOCKER_IMAGE
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
After the container is booted, you could get into the container through `docker exec`.
 | 
			
		||||
### **Step 3: Access the Running Container**  
 | 
			
		||||
 | 
			
		||||
#### FastChat serving engine
 | 
			
		||||
To run FastChat-serving using `IPEX-LLM` as backend, you can refer to this [document](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/src/ipex_llm/serving/fastchat).
 | 
			
		||||
Once the container is started, you can access it using:  
 | 
			
		||||
 | 
			
		||||
#### vLLM serving engine
 | 
			
		||||
```bash
 | 
			
		||||
sudo docker exec -it CONTAINER_NAME bash
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
To run vLLM engine using `IPEX-LLM` as backend, you can refer to this [document](https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/vLLM-Serving/README.md).
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
We have included multiple example files in `/llm/`:
 | 
			
		||||
1. `vllm_offline_inference.py`: Used for vLLM offline inference example
 | 
			
		||||
2. `benchmark_vllm_throughput.py`: Used for benchmarking throughput
 | 
			
		||||
3. `payload-1024.lua`: Used for testing request per second using 1k-128 request
 | 
			
		||||
4. `start-vllm-service.sh`: Used for template for starting vLLM service
 | 
			
		||||
## 3. Using `chat.py` for Inference  
 | 
			
		||||
 | 
			
		||||
##### Online benchmark throurgh api_server
 | 
			
		||||
The `chat.py` script is used for model inference. It is located under the `/llm` directory inside the container.  
 | 
			
		||||
 | 
			
		||||
We can benchmark the api_server to get an estimation about TPS (transactions per second).  To do so, you need to start the service first according to the instructions in this [section](https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/vLLM-Serving/README.md#service).
 | 
			
		||||
### Steps:  
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
In container, do the following:
 | 
			
		||||
1. modify the `/llm/payload-1024.lua` so that the "model" attribute is correct.  By default, we use a prompt that is roughly 1024 token long, you can change it if needed.
 | 
			
		||||
2. Start the benchmark using `wrk` using the script below:
 | 
			
		||||
1. **Run `chat.py` for inference** inside the container:  
 | 
			
		||||
 | 
			
		||||
   ```bash
 | 
			
		||||
   cd /llm
 | 
			
		||||
# You can change -t and -c to control the concurrency.
 | 
			
		||||
# By default, we use 12 connections to benchmark the service.
 | 
			
		||||
wrk -t4 -c4 -d15m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h
 | 
			
		||||
 | 
			
		||||
   python chat.py --model-path /llm/models/MODEL_NAME
 | 
			
		||||
   ```
 | 
			
		||||
#### Offline benchmark through benchmark_vllm_throughput.py
 | 
			
		||||
 | 
			
		||||
We have included the benchmark_throughput script provied by `vllm` in our image as `/llm/benchmark_vllm_throughput.py`.  To use the benchmark_throughput script, you will need to download the test dataset through:
 | 
			
		||||
   Replace `MODEL_NAME` with the name of your model.  
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 4. Serving with IPEX-LLM  
 | 
			
		||||
 | 
			
		||||
The container supports multiple serving engines.  
 | 
			
		||||
 | 
			
		||||
### 4.1 Serving with FastChat Engine  
 | 
			
		||||
 | 
			
		||||
To run FastChat-serving using `IPEX-LLM` as the backend, refer to this [document](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/src/ipex_llm/serving/fastchat).  
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
### 4.2 Serving with vLLM Engine  
 | 
			
		||||
 | 
			
		||||
To use **vLLM** with `IPEX-LLM` as the backend, refer to the [vLLM Serving Guide](https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/vLLM-Serving/README.md).  
 | 
			
		||||
 | 
			
		||||
The following example files are included in the `/llm/` directory inside the container:  
 | 
			
		||||
 | 
			
		||||
- `vllm_offline_inference.py`: Used for vLLM offline inference example.  
 | 
			
		||||
- `benchmark_vllm_throughput.py`: Used for throughput benchmarking.  
 | 
			
		||||
- `payload-1024.lua`: Used for testing requests per second with a 1k-128 request pattern.  
 | 
			
		||||
- `start-vllm-service.sh`: Template script for starting the vLLM service.  
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 5. Benchmarks  
 | 
			
		||||
 | 
			
		||||
### 5.1 Online Benchmark through API Server  
 | 
			
		||||
 | 
			
		||||
To benchmark the API Server and estimate transactions per second (TPS), first start the service as per the instructions in the [vLLM Serving Guide](https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/vLLM-Serving/README.md#service).  
 | 
			
		||||
 | 
			
		||||
Then, follow these steps:  
 | 
			
		||||
 | 
			
		||||
1. **Modify the `payload-1024.lua` file** to ensure the `"model"` attribute is correctly set.  
 | 
			
		||||
2. **Run the benchmark using `wrk`**:  
 | 
			
		||||
 | 
			
		||||
   ```bash
 | 
			
		||||
   cd /llm
 | 
			
		||||
   # You can adjust -t and -c to control concurrency.
 | 
			
		||||
   wrk -t4 -c4 -d15m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h
 | 
			
		||||
   ```
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
### 5.2 Offline Benchmark through `benchmark_vllm_throughput.py`  
 | 
			
		||||
 | 
			
		||||
1. **Download the test dataset**:  
 | 
			
		||||
 | 
			
		||||
   ```bash
 | 
			
		||||
   wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
 | 
			
		||||
   ```
 | 
			
		||||
 | 
			
		||||
The full example looks like this:
 | 
			
		||||
2. **Run the benchmark script**:  
 | 
			
		||||
 | 
			
		||||
   ```bash
 | 
			
		||||
   cd /llm/
 | 
			
		||||
 | 
			
		||||
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
 | 
			
		||||
 | 
			
		||||
   export MODEL="YOUR_MODEL"
 | 
			
		||||
 | 
			
		||||
   # You can change load-in-low-bit from values in [sym_int4, fp8, fp16]
 | 
			
		||||
 | 
			
		||||
   python3 /llm/benchmark_vllm_throughput.py \
 | 
			
		||||
       --backend vllm \
 | 
			
		||||
       --dataset /llm/ShareGPT_V3_unfiltered_cleaned_split.json \
 | 
			
		||||
| 
						 | 
				
			
			@ -89,3 +150,5 @@ python3 /llm/benchmark_vllm_throughput.py \
 | 
			
		|||
       --device cpu \
 | 
			
		||||
       --load-in-low-bit sym_int4
 | 
			
		||||
   ```
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
| 
						 | 
				
			
			@ -3,34 +3,40 @@ FROM intel/oneapi-basekit:2025.0.1-0-devel-ubuntu22.04 AS build
 | 
			
		|||
 | 
			
		||||
ARG http_proxy
 | 
			
		||||
ARG https_proxy
 | 
			
		||||
 | 
			
		||||
ENV TZ=Asia/Shanghai
 | 
			
		||||
ENV PYTHONUNBUFFERED=1
 | 
			
		||||
 | 
			
		||||
ARG PIP_NO_CACHE_DIR=false
 | 
			
		||||
 | 
			
		||||
ADD ./ccl_torch.patch /tmp/
 | 
			
		||||
# Set environment variables
 | 
			
		||||
ENV TZ=Asia/Shanghai PYTHONUNBUFFERED=1
 | 
			
		||||
 | 
			
		||||
RUN apt-get update && \
 | 
			
		||||
    apt-get install -y --no-install-recommends curl wget git libunwind8-dev vim less && \
 | 
			
		||||
    ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone && \
 | 
			
		||||
    env DEBIAN_FRONTEND=noninteractive apt-get update && \
 | 
			
		||||
    # add-apt-repository requires gnupg, gpg-agent, software-properties-common
 | 
			
		||||
    apt-get install -y --no-install-recommends gnupg gpg-agent software-properties-common && \
 | 
			
		||||
    # Add Python 3.11 PPA repository
 | 
			
		||||
# Copy patch file and benchmark scripts
 | 
			
		||||
ADD ./ccl_torch.patch /tmp/
 | 
			
		||||
COPY ./vllm_online_benchmark.py ./vllm_offline_inference.py ./vllm_offline_inference_vision_language.py \
 | 
			
		||||
     ./payload-1024.lua ./start-vllm-service.sh ./benchmark_vllm_throughput.py ./benchmark_vllm_latency.py \
 | 
			
		||||
     ./start-pp_serving-service.sh /llm/
 | 
			
		||||
     
 | 
			
		||||
RUN set -eux && \
 | 
			
		||||
    #
 | 
			
		||||
    # Update and install basic dependencies
 | 
			
		||||
    apt-get update && \
 | 
			
		||||
    apt-get install -y --no-install-recommends \
 | 
			
		||||
      curl wget git libunwind8-dev vim less gnupg gpg-agent software-properties-common \
 | 
			
		||||
      libfabric-dev wrk libaio-dev numactl && \
 | 
			
		||||
    # 
 | 
			
		||||
    # Set timezone
 | 
			
		||||
    ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && \
 | 
			
		||||
    echo $TZ > /etc/timezone && \
 | 
			
		||||
    # 
 | 
			
		||||
    # Install Python 3.11
 | 
			
		||||
    add-apt-repository ppa:deadsnakes/ppa -y && \
 | 
			
		||||
    apt-get install -y --no-install-recommends python3.11 git curl wget && \
 | 
			
		||||
    rm /usr/bin/python3 && \
 | 
			
		||||
    ln -s /usr/bin/python3.11 /usr/bin/python3 && \
 | 
			
		||||
    apt-get install -y --no-install-recommends python3.11 python3-pip python3.11-dev python3.11-distutils python3-wheel && \
 | 
			
		||||
    rm /usr/bin/python3 && ln -s /usr/bin/python3.11 /usr/bin/python3 && \
 | 
			
		||||
    ln -s /usr/bin/python3 /usr/bin/python && \
 | 
			
		||||
    apt-get install -y --no-install-recommends python3-pip python3.11-dev python3-wheel python3.11-distutils && \
 | 
			
		||||
    # 
 | 
			
		||||
    # Install pip and essential Python packages
 | 
			
		||||
    wget https://bootstrap.pypa.io/get-pip.py -O get-pip.py && \
 | 
			
		||||
    # Install FastChat from source requires PEP 660 support
 | 
			
		||||
    python3 get-pip.py && \
 | 
			
		||||
    rm get-pip.py && \
 | 
			
		||||
    pip install --upgrade requests argparse urllib3 && \
 | 
			
		||||
    apt-get install -y --no-install-recommends libfabric-dev wrk libaio-dev numactl && \
 | 
			
		||||
    # If we do not install this compute-runtime, we will fail the build later
 | 
			
		||||
    python3 get-pip.py && rm get-pip.py && \
 | 
			
		||||
    #
 | 
			
		||||
    # Install Intel GPU OpenCL Driver and Compute Runtime
 | 
			
		||||
    mkdir -p /tmp/neo && \
 | 
			
		||||
    cd /tmp/neo && \
 | 
			
		||||
    wget https://github.com/intel/intel-graphics-compiler/releases/download/v2.5.6/intel-igc-core-2_2.5.6+18417_amd64.deb && \
 | 
			
		||||
| 
						 | 
				
			
			@ -41,7 +47,11 @@ RUN apt-get update && \
 | 
			
		|||
    wget https://github.com/intel/compute-runtime/releases/download/24.52.32224.5/intel-opencl-icd_24.52.32224.5_amd64.deb && \
 | 
			
		||||
    wget https://github.com/intel/compute-runtime/releases/download/24.52.32224.5/libigdgmm12_22.5.5_amd64.deb && \
 | 
			
		||||
    dpkg -i *.deb && \
 | 
			
		||||
    #
 | 
			
		||||
    # Install Intel PyTorch extension for LLM inference
 | 
			
		||||
    pip install --pre --upgrade ipex-llm[xpu_2.6] --extra-index-url https://download.pytorch.org/whl/test/xpu && \
 | 
			
		||||
    #
 | 
			
		||||
    # Build torch-ccl
 | 
			
		||||
    mkdir /build && \
 | 
			
		||||
    cd /build && \
 | 
			
		||||
    git clone https://github.com/intel/torch-ccl.git && \
 | 
			
		||||
| 
						 | 
				
			
			@ -54,63 +64,77 @@ RUN apt-get update && \
 | 
			
		|||
    USE_SYSTEM_ONECCL=ON COMPUTE_BACKEND=dpcpp python setup.py bdist_wheel
 | 
			
		||||
    # File path: /build/torch-ccl/dist/oneccl_bind_pt-2.5.0+xpu-cp311-cp311-linux_x86_64.whl
 | 
			
		||||
 | 
			
		||||
# Second stage: Final runtime image
 | 
			
		||||
FROM intel/oneapi-basekit:2025.0.1-0-devel-ubuntu22.04
 | 
			
		||||
 | 
			
		||||
COPY --from=build /build/torch-ccl/dist/oneccl_bind_pt-2.5.0+xpu-cp311-cp311-linux_x86_64.whl /opt/oneccl_bind_pt-2.5.0+xpu-cp311-cp311-linux_x86_64.whl
 | 
			
		||||
# Copy the built torch-ccl package from the build stage
 | 
			
		||||
COPY --from=build /build/torch-ccl/dist/oneccl_bind_pt-2.5.0+xpu-cp311-cp311-linux_x86_64.whl /opt/
 | 
			
		||||
COPY --from=build /llm/ /llm/
 | 
			
		||||
 | 
			
		||||
ARG http_proxy
 | 
			
		||||
ARG https_proxy
 | 
			
		||||
 | 
			
		||||
ENV TZ=Asia/Shanghai
 | 
			
		||||
ENV PYTHONUNBUFFERED=1
 | 
			
		||||
# To prevent RPC_TIMEOUT ERROR for the first request
 | 
			
		||||
ENV VLLM_RPC_TIMEOUT=100000
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
# Disable pip's cache behavior
 | 
			
		||||
ARG PIP_NO_CACHE_DIR=false
 | 
			
		||||
 | 
			
		||||
RUN apt-get update && \
 | 
			
		||||
    apt-get install -y --no-install-recommends curl wget git libunwind8-dev vim less && \
 | 
			
		||||
    ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone && \
 | 
			
		||||
    env DEBIAN_FRONTEND=noninteractive apt-get update && \
 | 
			
		||||
    apt-get install -y --no-install-recommends gnupg gpg-agent software-properties-common kmod && \
 | 
			
		||||
    # Add Python 3.11 PPA repository
 | 
			
		||||
# Set environment variables
 | 
			
		||||
ENV TZ=Asia/Shanghai PYTHONUNBUFFERED=1 VLLM_RPC_TIMEOUT=100000
 | 
			
		||||
 | 
			
		||||
RUN set -eux && \
 | 
			
		||||
    #
 | 
			
		||||
    # Update and install basic dependencies
 | 
			
		||||
    apt-get update && \
 | 
			
		||||
    apt-get install -y --no-install-recommends \
 | 
			
		||||
      curl wget git libunwind8-dev vim less gnupg gpg-agent software-properties-common \
 | 
			
		||||
      libfabric-dev wrk libaio-dev numactl && \
 | 
			
		||||
    # 
 | 
			
		||||
    # Set timezone
 | 
			
		||||
    ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && \
 | 
			
		||||
    echo $TZ > /etc/timezone && \
 | 
			
		||||
    # 
 | 
			
		||||
    # Install Python 3.11
 | 
			
		||||
    add-apt-repository ppa:deadsnakes/ppa -y && \
 | 
			
		||||
    apt-get install -y --no-install-recommends python3.11 git curl wget && \
 | 
			
		||||
    rm /usr/bin/python3 && \
 | 
			
		||||
    ln -s /usr/bin/python3.11 /usr/bin/python3 && \
 | 
			
		||||
    apt-get install -y --no-install-recommends python3.11 python3-pip python3.11-dev python3.11-distutils python3-wheel && \
 | 
			
		||||
    rm /usr/bin/python3 && ln -s /usr/bin/python3.11 /usr/bin/python3 && \
 | 
			
		||||
    ln -s /usr/bin/python3 /usr/bin/python && \
 | 
			
		||||
    apt-get install -y --no-install-recommends python3-pip python3.11-dev python3-wheel python3.11-distutils && \
 | 
			
		||||
    # 
 | 
			
		||||
    # Install pip and essential Python packages
 | 
			
		||||
    wget https://bootstrap.pypa.io/get-pip.py -O get-pip.py && \
 | 
			
		||||
    python3 get-pip.py && \
 | 
			
		||||
    rm get-pip.py && \
 | 
			
		||||
    python3 get-pip.py && rm get-pip.py && \
 | 
			
		||||
    pip install --upgrade requests argparse urllib3 && \
 | 
			
		||||
    pip install --pre --upgrade ipex-llm[xpu_2.6] --extra-index-url https://download.pytorch.org/whl/test/xpu && \
 | 
			
		||||
    pip install transformers_stream_generator einops tiktoken && \
 | 
			
		||||
    pip install --upgrade colorama && \
 | 
			
		||||
    # 
 | 
			
		||||
    git clone https://github.com/intel/ipex-llm.git && \
 | 
			
		||||
    cp -r ./ipex-llm/python/llm/dev/benchmark/ ./benchmark && \
 | 
			
		||||
    cp -r ./ipex-llm/python/llm/example/GPU/HuggingFace/LLM ./examples && \
 | 
			
		||||
    cp -r ./ipex-llm/python/llm/example/GPU/vLLM-Serving/ ./vLLM-Serving && \
 | 
			
		||||
    #
 | 
			
		||||
    # Download pp_serving
 | 
			
		||||
    mkdir -p /llm/pp_serving && \
 | 
			
		||||
    cp ./ipex-llm/python/llm/example/GPU/Pipeline-Parallel-Serving/*.py /llm/pp_serving/ && \
 | 
			
		||||
    #
 | 
			
		||||
    # Download lightweight_serving
 | 
			
		||||
    mkdir -p /llm/lightweight_serving && \
 | 
			
		||||
    cp ./ipex-llm/python/llm/example/GPU/Lightweight-Serving/*.py /llm/lightweight_serving/ && \
 | 
			
		||||
    rm -rf ./ipex-llm && \
 | 
			
		||||
    #
 | 
			
		||||
    # Install vllm dependencies
 | 
			
		||||
    pip install --upgrade fastapi && \
 | 
			
		||||
    pip install --upgrade "uvicorn[standard]" && \
 | 
			
		||||
    #
 | 
			
		||||
    # Install torch-ccl
 | 
			
		||||
    pip install /opt/oneccl_bind_pt-2.5.0+xpu-cp311-cp311-linux_x86_64.whl && \
 | 
			
		||||
    # install Internal oneccl
 | 
			
		||||
    #
 | 
			
		||||
    # Install Internal oneccl
 | 
			
		||||
    cd /opt && \
 | 
			
		||||
    wget https://sourceforge.net/projects/oneccl-wks/files/2025.0.0.6.6-release/oneccl_wks_installer_2025.0.0.6.6.sh && \
 | 
			
		||||
    bash oneccl_wks_installer_2025.0.0.6.6.sh && \
 | 
			
		||||
    apt-get update && \
 | 
			
		||||
    apt-get install -y --no-install-recommends libfabric-dev wrk libaio-dev numactl && \
 | 
			
		||||
    # 
 | 
			
		||||
    # Remove breaks install packages
 | 
			
		||||
    apt-get remove -y libze-dev libze-intel-gpu1 && \
 | 
			
		||||
    #
 | 
			
		||||
    # Install compute runtime
 | 
			
		||||
    mkdir -p /tmp/neo && \
 | 
			
		||||
    cd /tmp/neo && \
 | 
			
		||||
| 
						 | 
				
			
			@ -121,10 +145,11 @@ RUN apt-get update && \
 | 
			
		|||
    wget https://github.com/intel/compute-runtime/releases/download/24.52.32224.5/intel-opencl-icd-dbgsym_24.52.32224.5_amd64.ddeb && \
 | 
			
		||||
    wget https://github.com/intel/compute-runtime/releases/download/24.52.32224.5/intel-opencl-icd_24.52.32224.5_amd64.deb && \
 | 
			
		||||
    wget https://github.com/intel/compute-runtime/releases/download/24.52.32224.5/libigdgmm12_22.5.5_amd64.deb && \
 | 
			
		||||
    dpkg -i *.deb && \
 | 
			
		||||
    dpkg -i *.deb && rm -rf /tmp/neo && \
 | 
			
		||||
    mkdir -p /llm && \
 | 
			
		||||
    cd /llm && \
 | 
			
		||||
    rm -rf /tmp/neo && \
 | 
			
		||||
    #
 | 
			
		||||
    # Install vllm
 | 
			
		||||
    git clone -b 0.6.6 https://github.com/analytics-zoo/vllm.git /llm/vllm && \
 | 
			
		||||
    cd /llm/vllm && \
 | 
			
		||||
| 
						 | 
				
			
			@ -135,13 +160,4 @@ RUN apt-get update && \
 | 
			
		|||
    pip install gradio==4.43.0 && \
 | 
			
		||||
    pip install ray
 | 
			
		||||
 | 
			
		||||
COPY ./vllm_online_benchmark.py                   /llm/
 | 
			
		||||
COPY ./vllm_offline_inference.py                  /llm/
 | 
			
		||||
COPY ./vllm_offline_inference_vision_language.py  /llm/
 | 
			
		||||
COPY ./payload-1024.lua                           /llm/
 | 
			
		||||
COPY ./start-vllm-service.sh                      /llm/
 | 
			
		||||
COPY ./benchmark_vllm_throughput.py               /llm/
 | 
			
		||||
COPY ./benchmark_vllm_latency.py                  /llm/
 | 
			
		||||
COPY ./start-pp_serving-service.sh                /llm/
 | 
			
		||||
 | 
			
		||||
WORKDIR /llm/
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -1,6 +1,12 @@
 | 
			
		|||
## Build/Use IPEX-LLM-serving xpu image
 | 
			
		||||
# IPEX-LLM-serving XPU Image: Build and Usage Guide
 | 
			
		||||
 | 
			
		||||
This document outlines the steps to build and use the `IPEX-LLM-serving-xpu` Docker image, including inference, serving, and benchmarking functionalities for XPU.
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 1. Build the Image
 | 
			
		||||
 | 
			
		||||
To build the `IPEX-LLM-serving-xpu` Docker image, use the following command:
 | 
			
		||||
 | 
			
		||||
### Build Image
 | 
			
		||||
```bash
 | 
			
		||||
docker build \
 | 
			
		||||
  --build-arg http_proxy=.. \
 | 
			
		||||
| 
						 | 
				
			
			@ -9,13 +15,55 @@ docker build \
 | 
			
		|||
  --rm --no-cache -t intelanalytics/ipex-llm-serving-xpu:2.2.0-SNAPSHOT .
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
### Use the image for doing xpu serving
 | 
			
		||||
## 2. Using the Image for XPU Inference
 | 
			
		||||
 | 
			
		||||
To map the `XPU` into the container, you need to specify `--device=/dev/dri` when starting the container.
 | 
			
		||||
 | 
			
		||||
To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container.
 | 
			
		||||
### Example:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
#/bin/bash
 | 
			
		||||
export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:2.2.0-SNAPSHOT
 | 
			
		||||
 | 
			
		||||
sudo docker run -itd \
 | 
			
		||||
        --net=host \
 | 
			
		||||
        --device=/dev/dri \
 | 
			
		||||
        --memory="32G" \
 | 
			
		||||
        --name=CONTAINER_NAME \
 | 
			
		||||
        --shm-size="16g" \
 | 
			
		||||
        $DOCKER_IMAGE
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Once the container is up and running, use `docker exec` to enter it.
 | 
			
		||||
 | 
			
		||||
To verify if the XPU device is successfully mapped into the container, run the following:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
sycl-ls
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
For a machine with Arc A770, the output will be similar to:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
root@arda-arc12:/# sycl-ls
 | 
			
		||||
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
 | 
			
		||||
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K 3.0 [2023.16.7.0.21_160000]
 | 
			
		||||
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33]
 | 
			
		||||
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
For detailed instructions on running inference with `IPEX-LLM` on XPU, refer to this [documentation](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU).
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 3. Using the Image for XPU Serving
 | 
			
		||||
 | 
			
		||||
To run XPU serving, you need to map the XPU into the container by specifying `--device=/dev/dri` when booting the container.
 | 
			
		||||
 | 
			
		||||
### Example:
 | 
			
		||||
 | 
			
		||||
An example could be:
 | 
			
		||||
```bash
 | 
			
		||||
#/bin/bash
 | 
			
		||||
export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest
 | 
			
		||||
| 
						 | 
				
			
			@ -28,68 +76,76 @@ sudo docker run -itd \
 | 
			
		|||
        $DOCKER_IMAGE
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
After the container starts, access it using `docker exec`.
 | 
			
		||||
 | 
			
		||||
After the container is booted, you could get into the container through `docker exec`.
 | 
			
		||||
 | 
			
		||||
To verify the device is successfully mapped into the container, run `sycl-ls` to check the result. In a machine with Arc A770, the sampled output is:
 | 
			
		||||
To verify that the device is correctly mapped, run:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
root@arda-arc12:/# sycl-ls
 | 
			
		||||
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
 | 
			
		||||
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K 3.0 [2023.16.7.0.21_160000]
 | 
			
		||||
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33]
 | 
			
		||||
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
 | 
			
		||||
sycl-ls
 | 
			
		||||
```
 | 
			
		||||
After the container is booted, you could get into the container through `docker exec`.
 | 
			
		||||
 | 
			
		||||
Currently, we provide two different serving engines in the image, which are FastChat serving engine and vLLM serving engine.
 | 
			
		||||
The output will be similar to the example in the inference section above.
 | 
			
		||||
 | 
			
		||||
Currently, the image supports two different serving engines: **FastChat** and **vLLM**.
 | 
			
		||||
 | 
			
		||||
#### Lightweight serving engine
 | 
			
		||||
### Serving Engines
 | 
			
		||||
 | 
			
		||||
To run Lightweight serving on one intel gpu using `IPEX-LLM` as backend, you can refer to this [readme](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/Lightweight-Serving).
 | 
			
		||||
#### 3.1 Lightweight Serving Engine
 | 
			
		||||
 | 
			
		||||
For convenience, we have included a file `/llm/start-lightweight_serving-service` in the image. And need to install the appropriate transformers version first, like `pip install transformers==4.37.0`.
 | 
			
		||||
For running lightweight serving on Intel GPUs using `IPEX-LLM` as the backend, refer to the [Lightweight-Serving README](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/Lightweight-Serving).
 | 
			
		||||
 | 
			
		||||
We have included a script `/llm/start-lightweight_serving-service` in the image. Make sure to install the correct `transformers` version before proceeding, like so:
 | 
			
		||||
 | 
			
		||||
#### Pipeline parallel serving engine
 | 
			
		||||
```bash
 | 
			
		||||
pip install transformers==4.37.0
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
To run Pipeline parallel serving using `IPEX-LLM` as backend, you can refer to this [readme](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/Pipeline-Parallel-FastAPI).
 | 
			
		||||
#### 3.2 Pipeline Parallel Serving Engine
 | 
			
		||||
 | 
			
		||||
For convenience, we have included a file `/llm/start-pp_serving-service.sh` in the image. And need to install the appropriate transformers version first, like `pip install transformers==4.37.0`.
 | 
			
		||||
To use the **Pipeline Parallel** serving engine with `IPEX-LLM` as the backend, refer to this [Pipeline-Parallel-FastAPI README](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/Pipeline-Parallel-FastAPI).
 | 
			
		||||
 | 
			
		||||
A convenience script `/llm/start-pp_serving-service.sh` is included in the image. Be sure to install the required version of `transformers`, like so:
 | 
			
		||||
 | 
			
		||||
#### vLLM serving engine
 | 
			
		||||
```bash
 | 
			
		||||
pip install transformers==4.37.0
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
To run vLLM engine using `IPEX-LLM` as backend, you can refer to this [document](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/DockerGuides/vllm_docker_quickstart.md).
 | 
			
		||||
#### 3.3 vLLM Serving Engine
 | 
			
		||||
 | 
			
		||||
We have included multiple example files in `/llm/`:
 | 
			
		||||
1. `vllm_offline_inference.py`: Used for vLLM offline inference example
 | 
			
		||||
2. `benchmark_vllm_throughput.py`: Used for benchmarking throughput
 | 
			
		||||
3. `payload-1024.lua`: Used for testing request per second using 1k-128 request
 | 
			
		||||
4. `start-vllm-service.sh`: Used for template for starting vLLM service
 | 
			
		||||
5. `vllm_offline_inference_vision_language.py`: Used for vLLM offline inference vision example
 | 
			
		||||
For running the **vLLM engine** with `IPEX-LLM` as the backend, refer to this [vLLM Docker Quickstart Guide](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/DockerGuides/vllm_docker_quickstart.md).
 | 
			
		||||
 | 
			
		||||
##### Online benchmark throurgh api_server
 | 
			
		||||
The following example files are available in `/llm/` within the container:
 | 
			
		||||
 | 
			
		||||
We can benchmark the api_server to get an estimation about TPS (transactions per second).  To do so, you need to start the service first according to the instructions in this [section](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/DockerGuides/vllm_docker_quickstart.md#Serving).
 | 
			
		||||
1. `vllm_offline_inference.py`: vLLM offline inference example
 | 
			
		||||
2. `benchmark_vllm_throughput.py`: Throughput benchmarking
 | 
			
		||||
3. `payload-1024.lua`: Request-per-second test (using 1k-128 request)
 | 
			
		||||
4. `start-vllm-service.sh`: Template for starting the vLLM service
 | 
			
		||||
5. `vllm_offline_inference_vision_language.py`: vLLM offline inference for vision-based models
 | 
			
		||||
 | 
			
		||||
###### Online benchmark through benchmark_util
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
## 4. Benchmarking
 | 
			
		||||
 | 
			
		||||
### 4.1 Online Benchmark through API Server
 | 
			
		||||
 | 
			
		||||
To benchmark the API server and estimate TPS (transactions per second), follow these steps:
 | 
			
		||||
 | 
			
		||||
1. Start the service as per the instructions in this [section](https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/DockerGuides/vllm_docker_quickstart.md#Serving).
 | 
			
		||||
2. Run the benchmark using `vllm_online_benchmark.py`:
 | 
			
		||||
 | 
			
		||||
After starting vllm service, Sending reqs through `vllm_online_benchmark.py`
 | 
			
		||||
```bash
 | 
			
		||||
python vllm_online_benchmark.py $model_name $max_seqs $input_length $output_length
 | 
			
		||||
```
 | 
			
		||||
If `input_length` and `output_length` are not provided, the script will use the default values of 1024 and 512, respectively.
 | 
			
		||||
 | 
			
		||||
And it will output like this:
 | 
			
		||||
If `input_length` and `output_length` are not provided, the script defaults to values of 1024 and 512 tokens, respectively. The output will look something like:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
model_name: Qwen1.5-14B-Chat
 | 
			
		||||
max_seq: 12
 | 
			
		||||
Warm Up: 100%|█████████████████████████████████████████████████████| 24/24 [01:36<00:00,  4.03s/req]
 | 
			
		||||
Benchmarking: 100%|████████████████████████████████████████████████| 60/60 [04:03<00:00,  4.05s/req]
 | 
			
		||||
Total time for 60 requests with 12 concurrent requests: xxx seconds.
 | 
			
		||||
Average responce time: xxx
 | 
			
		||||
Average response time: xxx
 | 
			
		||||
Token throughput: xxx
 | 
			
		||||
 | 
			
		||||
Average first token latency: xxx milliseconds.
 | 
			
		||||
| 
						 | 
				
			
			@ -101,57 +157,40 @@ P90 next token latency: xxx milliseconds.
 | 
			
		|||
P95 next token latency: xxx milliseconds.
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
###### Online benchmark with multimodal through benchmark_util
 | 
			
		||||
### 4.2 Online Benchmark with Multimodal Input
 | 
			
		||||
 | 
			
		||||
After starting the vLLM service, you can benchmark multimodal inputs using `vllm_online_benchmark_multimodal.py`:
 | 
			
		||||
 | 
			
		||||
After starting vllm service, Sending reqs through `vllm_online_benchmark_multimodal.py`
 | 
			
		||||
```bash
 | 
			
		||||
export image_url="http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg"
 | 
			
		||||
python vllm_online_benchmark_multimodal.py --model-name $model_name --image-url $image_url --prompt "What is in the image?" --port 8000
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
`image_url` can be `/llm/xxx.jpg` or `"http://xxx.jpg`.
 | 
			
		||||
The `image_url` can be a local path (e.g., `/llm/xxx.jpg`) or an external URL (e.g., `"http://xxx.jpg`).
 | 
			
		||||
 | 
			
		||||
And it will output like this:
 | 
			
		||||
```bash
 | 
			
		||||
model_name: MiniCPM-V-2_6
 | 
			
		||||
Warm Up: 100%|███████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.68s/req]
 | 
			
		||||
Warm Up: 100%|███████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.42s/req]
 | 
			
		||||
Benchmarking: 100%|██████████████████████████████████████████████████| 3/3 [00:31<00:00, 10.43s/req]
 | 
			
		||||
Total time for 3 requests with 1 concurrent requests: xxx seconds.
 | 
			
		||||
Average responce time: xxx
 | 
			
		||||
Token throughput: xxx
 | 
			
		||||
The output will be similar to the example in the API benchmarking section.
 | 
			
		||||
 | 
			
		||||
Average first token latency: xxx milliseconds.
 | 
			
		||||
P90 first token latency: xxx milliseconds.
 | 
			
		||||
P95 first token latency: xxx milliseconds.
 | 
			
		||||
### 4.3 Online Benchmark through wrk
 | 
			
		||||
 | 
			
		||||
Average next token latency: xxx milliseconds.
 | 
			
		||||
P90 next token latency: xxx milliseconds.
 | 
			
		||||
P95 next token latency: xxx milliseconds.
 | 
			
		||||
```
 | 
			
		||||
In the container, modify the `payload-1024.lua` to ensure the "model" attribute is correct. By default, it uses a prompt of about 1024 tokens.
 | 
			
		||||
 | 
			
		||||
###### Online benchmark through wrk
 | 
			
		||||
In container, do the following:
 | 
			
		||||
1. modify the `/llm/payload-1024.lua` so that the "model" attribute is correct.  By default, we use a prompt that is roughly 1024 token long, you can change it if needed.
 | 
			
		||||
2. Start the benchmark using `wrk` using the script below:
 | 
			
		||||
Then, start the benchmark using `wrk`:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
cd /llm
 | 
			
		||||
# You can change -t and -c to control the concurrency.
 | 
			
		||||
# By default, we use 12 connections to benchmark the service.
 | 
			
		||||
wrk -t12 -c12 -d15m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
#### Offline benchmark through benchmark_vllm_throughput.py
 | 
			
		||||
### 4.4 Offline Benchmark through `benchmark_vllm_throughput.py`
 | 
			
		||||
 | 
			
		||||
We have included the benchmark_throughput script provied by `vllm` in our image as `/llm/benchmark_vllm_throughput.py`.  To use the benchmark_throughput script, you will need to download the test dataset through:
 | 
			
		||||
To use the `benchmark_vllm_throughput.py` script, first download the test dataset:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
The full example looks like this:
 | 
			
		||||
Then, run the benchmark:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
cd /llm/
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			@ -159,8 +198,6 @@ wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/r
 | 
			
		|||
 | 
			
		||||
export MODEL="YOUR_MODEL"
 | 
			
		||||
 | 
			
		||||
# You can change load-in-low-bit from values in [sym_int4, fp8, fp16]
 | 
			
		||||
 | 
			
		||||
python3 /llm/benchmark_vllm_throughput.py \
 | 
			
		||||
    --backend vllm \
 | 
			
		||||
    --dataset /llm/ShareGPT_V3_unfiltered_cleaned_split.json \
 | 
			
		||||
| 
						 | 
				
			
			@ -175,57 +212,4 @@ python3 /llm/benchmark_vllm_throughput.py \
 | 
			
		|||
    --gpu-memory-utilization 0.85
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
> Note: you can adjust --load-in-low-bit to use other formats of low-bit quantization.
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
You can also adjust `--gpu-memory-utilization` rate using the below script to find the best performance using the following script:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
#!/bin/bash
 | 
			
		||||
 | 
			
		||||
# Define the log directory
 | 
			
		||||
LOG_DIR="YOUR_LOG_DIR"
 | 
			
		||||
# Check if the log directory exists, if not, create it
 | 
			
		||||
if [ ! -d "$LOG_DIR" ]; then
 | 
			
		||||
    mkdir -p "$LOG_DIR"
 | 
			
		||||
fi
 | 
			
		||||
 | 
			
		||||
# Define an array of model paths
 | 
			
		||||
MODELS=(
 | 
			
		||||
    "YOUR TESTED MODELS"
 | 
			
		||||
)
 | 
			
		||||
 | 
			
		||||
# Define an array of utilization rates
 | 
			
		||||
UTIL_RATES=(0.85 0.90 0.95)
 | 
			
		||||
 | 
			
		||||
# Loop over each model
 | 
			
		||||
for MODEL in "${MODELS[@]}"; do
 | 
			
		||||
    # Loop over each utilization rate
 | 
			
		||||
    for RATE in "${UTIL_RATES[@]}"; do
 | 
			
		||||
        # Extract a simple model name from the path for easier identification
 | 
			
		||||
        MODEL_NAME=$(basename "$MODEL")
 | 
			
		||||
 | 
			
		||||
        # Define the log file name based on the model and rate
 | 
			
		||||
        LOG_FILE="$LOG_DIR/${MODEL_NAME}_utilization_${RATE}.log"
 | 
			
		||||
 | 
			
		||||
        # Execute the command and redirect output to the log file
 | 
			
		||||
        # Sometimes you might need to set --max-model-len if memory is not enough
 | 
			
		||||
        # load-in-low-bit accepts inputs [sym_int4, fp8, fp16]
 | 
			
		||||
        python3 /llm/benchmark_vllm_throughput.py \
 | 
			
		||||
            --backend vllm \
 | 
			
		||||
            --dataset /llm/ShareGPT_V3_unfiltered_cleaned_split.json \
 | 
			
		||||
            --model $MODEL \
 | 
			
		||||
            --num-prompts 1000 \
 | 
			
		||||
            --seed 42 \
 | 
			
		||||
            --trust-remote-code \
 | 
			
		||||
            --enforce-eager \
 | 
			
		||||
            --dtype float16 \
 | 
			
		||||
            --load-in-low-bit sym_int4 \
 | 
			
		||||
            --device xpu \
 | 
			
		||||
            --gpu-memory-utilization $RATE &> "$LOG_FILE"
 | 
			
		||||
    done
 | 
			
		||||
done
 | 
			
		||||
 | 
			
		||||
# Inform the user that the script has completed its execution
 | 
			
		||||
echo "All benchmarks have been executed and logged."
 | 
			
		||||
```
 | 
			
		||||
---
 | 
			
		||||
| 
						 | 
				
			
			@ -11,9 +11,9 @@ Follow the [Docker installation Guide](./docker_windows_gpu.md#install-docker) t
 | 
			
		|||
 | 
			
		||||
## Launch Docker
 | 
			
		||||
 | 
			
		||||
Prepare ipex-llm-xpu Docker Image:
 | 
			
		||||
Prepare ipex-llm-serving-xpu Docker Image:
 | 
			
		||||
```bash
 | 
			
		||||
docker pull intelanalytics/ipex-llm-xpu:latest
 | 
			
		||||
docker pull intelanalytics/ipex-llm-serving-xpu:latest
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Start ipex-llm-xpu Docker Container. Choose one of the following commands to start the container:
 | 
			
		||||
| 
						 | 
				
			
			@ -21,7 +21,7 @@ Start ipex-llm-xpu Docker Container. Choose one of the following commands to sta
 | 
			
		|||
- For **Linux users**:
 | 
			
		||||
 | 
			
		||||
  ```bash
 | 
			
		||||
  export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest
 | 
			
		||||
  export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest
 | 
			
		||||
  export CONTAINER_NAME=my_container
 | 
			
		||||
  export MODEL_PATH=/llm/models[change to your model path]
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			@ -39,7 +39,7 @@ Start ipex-llm-xpu Docker Container. Choose one of the following commands to sta
 | 
			
		|||
 | 
			
		||||
  ```bash
 | 
			
		||||
  #/bin/bash
 | 
			
		||||
  export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest
 | 
			
		||||
  export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest
 | 
			
		||||
  export CONTAINER_NAME=my_container
 | 
			
		||||
  export MODEL_PATH=/llm/models[change to your model path]
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -46,16 +46,16 @@ Press F1 to bring up the Command Palette and type in `WSL: Connect to WSL Usin
 | 
			
		|||
Open the Terminal in VSCode (you can use the shortcut `` Ctrl+Shift+` ``), then pull ipex-llm-xpu Docker Image:
 | 
			
		||||
 | 
			
		||||
```bash
 | 
			
		||||
docker pull intelanalytics/ipex-llm-xpu:latest
 | 
			
		||||
docker pull intelanalytics/ipex-llm-serving-xpu:latest
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Start ipex-llm-xpu Docker Container. Choose one of the following commands to start the container:
 | 
			
		||||
Start ipex-llm-serving-xpu Docker Container. Choose one of the following commands to start the container:
 | 
			
		||||
 | 
			
		||||
- For **Linux users**:
 | 
			
		||||
 | 
			
		||||
  ```bash
 | 
			
		||||
 | 
			
		||||
  export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest
 | 
			
		||||
  export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest
 | 
			
		||||
  export CONTAINER_NAME=my_container
 | 
			
		||||
  export MODEL_PATH=/llm/models[change to your model path]
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			@ -73,7 +73,7 @@ Start ipex-llm-xpu Docker Container. Choose one of the following commands to sta
 | 
			
		|||
 | 
			
		||||
  ```bash
 | 
			
		||||
  #/bin/bash
 | 
			
		||||
  export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest
 | 
			
		||||
  export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest
 | 
			
		||||
  export CONTAINER_NAME=my_container
 | 
			
		||||
  export MODEL_PATH=/llm/models[change to your model path]
 | 
			
		||||
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
| 
						 | 
				
			
			@ -69,10 +69,9 @@ We have several docker images available for running LLMs on Intel GPUs. The foll
 | 
			
		|||
 | 
			
		||||
| Image Name | Description | Use Case |
 | 
			
		||||
|------------|-------------|----------|
 | 
			
		||||
| intelanalytics/ipex-llm-cpu:latest | CPU Inference |For development and running LLMs using llama.cpp, Ollama and Python|
 | 
			
		||||
| intelanalytics/ipex-llm-xpu:latest | GPU Inference |For development and running LLMs using llama.cpp, Ollama and Python|
 | 
			
		||||
| intelanalytics/ipex-llm-serving-cpu:latest | CPU Serving|For serving multiple users/requests through REST APIs using vLLM/FastChat|
 | 
			
		||||
| intelanalytics/ipex-llm-serving-xpu:latest | GPU Serving|For serving multiple users/requests through REST APIs using vLLM/FastChat|
 | 
			
		||||
| intelanalytics/ipex-llm-inference-cpp-xpu:latest | Run llama.cpp/Ollama/Open-WebUI on GPU via Docker|
 | 
			
		||||
| intelanalytics/ipex-llm-serving-cpu:latest | CPU Inference & Serving|For inference or serving multiple users/requests through REST APIs using vLLM/FastChat|
 | 
			
		||||
| intelanalytics/ipex-llm-serving-xpu:latest | GPU Inference & Serving|For inference or serving multiple users/requests through REST APIs using vLLM/FastChat|
 | 
			
		||||
| intelanalytics/ipex-llm-finetune-qlora-cpu-standalone:latest | CPU Finetuning via Docker|For fine-tuning LLMs using QLora/Lora, etc. |
 | 
			
		||||
|intelanalytics/ipex-llm-finetune-qlora-cpu-k8s:latest|CPU Finetuning via Kubernetes|For fine-tuning LLMs using QLora/Lora, etc. |
 | 
			
		||||
| intelanalytics/ipex-llm-finetune-qlora-xpu:latest| GPU Finetuning|For fine-tuning LLMs using QLora/Lora, etc.|
 | 
			
		||||
| 
						 | 
				
			
			
 | 
			
		|||
		Loading…
	
		Reference in a new issue