History

Shaojun Liu 1d7f4a83ac Update documentation to build Docker image from Dockerfile instead of pulling from registry (#13057 ) * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update docker_cpp_xpu_quickstart.md * Update vllm_cpu_docker_quickstart.md * Update docker_cpp_xpu_quickstart.md * Update vllm_docker_quickstart.md * Update fastchat_docker_quickstart.md * Update docker_pytorch_inference_gpu.md		2025-04-09 16:40:20 +08:00
..
benchmark_vllm_throughput.py	Docs: Fix CPU Serving Docker README (#11351 )	2024-06-18 16:27:51 +08:00
Dockerfile	vLLM: Fix vLLM CPU dockerfile to resolve cmake deprecated issue (#13026 )	2025-03-31 16:09:25 +08:00
model_adapter.py.patch	Refactor bigdl.llm to ipex_llm (#24 )	2024-03-22 15:41:21 +08:00
payload-1024.lua	LLM: Add CPU vLLM entrypoint (#11083 )	2024-05-24 09:16:59 +08:00
README.md	Update documentation to build Docker image from Dockerfile instead of pulling from registry (#13057 )	2025-04-09 16:40:20 +08:00
start-fastchat-service.sh	LLM: Add CPU vLLM entrypoint (#11083 )	2024-05-24 09:16:59 +08:00
start-notebook.sh	Merge CPU & XPU Dockerfiles with Serving Images and Refactor (#12815 )	2025-02-17 14:23:22 +08:00
start-vllm-service.sh	LLM: Fix vLLM CPU version error (#11206 )	2024-06-04 19:10:23 +08:00
vllm_offline_inference.py	LLM: Fix vLLM CPU version error (#11206 )	2024-06-04 19:10:23 +08:00

README.md

IPEX-LLM-Serving CPU Image: Build and Usage Guide

This document provides instructions for building and using the IPEX-LLM-serving CPU Docker image, including model inference, serving, and benchmarking functionalities.

1. Build the Image

To build the ipex-llm-serving-cpu Docker image, run the following command:

docker build \
  --build-arg http_proxy=.. \
  --build-arg https_proxy=.. \
  --build-arg no_proxy=.. \
  --rm --no-cache -t intelanalytics/ipex-llm-serving-cpu:latest .

2. Run the Container

Before running chat.py or using serving functionalities, start the container using the following command.

Step 1: Download the Model (Optional)

If using a local model, download it to your host machine and bind the directory to the container when launching it.

export MODEL_PATH=/home/llm/models  # Change this to your model directory

This ensures the container has access to the necessary models.

Step 2: Start the Container

Use the following command to start the container:

export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-cpu:latest

sudo docker run -itd \
        --net=host \  # Use host networking for performance
        --cpuset-cpus="0-47" \  # Limit the container to specific CPU cores
        --cpuset-mems="0" \  # Bind the container to NUMA node 0 for memory locality
        --memory="32G" \  # Limit memory usage to 32GB
        --shm-size="16g" \  # Set shared memory size to 16GB (useful for large models)
        --name=CONTAINER_NAME \
        -v $MODEL_PATH:/llm/models/ \  # Mount the model directory
        $DOCKER_IMAGE

Step 3: Access the Running Container

Once the container is started, you can access it using:

sudo docker exec -it CONTAINER_NAME bash

3. Using `chat.py` for Inference

The chat.py script is used for model inference. It is located under the /llm directory inside the container.

Steps:

Run chat.py for inference inside the container:
```
cd /llm
python chat.py --model-path /llm/models/MODEL_NAME
```
Replace MODEL_NAME with the name of your model.

4. Serving with IPEX-LLM

The container supports multiple serving engines.

4.1 Serving with FastChat Engine

To run FastChat-serving using IPEX-LLM as the backend, refer to this document.

4.2 Serving with vLLM Engine

To use vLLM with IPEX-LLM as the backend, refer to the vLLM Serving Guide.

The following example files are included in the /llm/ directory inside the container:

vllm_offline_inference.py: Used for vLLM offline inference example.
benchmark_vllm_throughput.py: Used for throughput benchmarking.
payload-1024.lua: Used for testing requests per second with a 1k-128 request pattern.
start-vllm-service.sh: Template script for starting the vLLM service.

5. Benchmarks

5.1 Online Benchmark through API Server

To benchmark the API Server and estimate transactions per second (TPS), first start the service as per the instructions in the vLLM Serving Guide.

Then, follow these steps:

Modify the payload-1024.lua file to ensure the "model" attribute is correctly set.

Run the benchmark using wrk:

cd /llm
# You can adjust -t and -c to control concurrency.
wrk -t4 -c4 -d15m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h

5.2 Offline Benchmark through `benchmark_vllm_throughput.py`

Download the test dataset:

wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

Run the benchmark script:

cd /llm/

export MODEL="YOUR_MODEL"

# You can change load-in-low-bit from values in [sym_int4, fp8, fp16]
python3 /llm/benchmark_vllm_throughput.py \
    --backend vllm \
    --dataset /llm/ShareGPT_V3_unfiltered_cleaned_split.json \
    --model $MODEL \
    --num-prompts 1000 \
    --seed 42 \
    --trust-remote-code \
    --enforce-eager \
    --dtype bfloat16 \
    --device cpu \
    --load-in-low-bit sym_int4