* Enable single card sync engine * enable ipex-llm optimizations for vllm * enable optimizations for lm_head * Fix chatglm multi-reference problem * Remove duplicate layer * LLM: Update vLLM to v0.5.4 (#11746) * Enable single card sync engine * enable ipex-llm optimizations for vllm * enable optimizations for lm_head * Fix chatglm multi-reference problem * update 0.5.4 api_server * add dockerfile * fix * fix * refine * fix --------- Co-authored-by: gc-fu <guancheng.fu@intel.com> * Add vllm-0.5.4 Dockerfile (#11838) * Update BIGDL_LLM_SDP_IGNORE_MASK in start-vllm-service.sh (#11957) * Fix vLLM not convert issues (#11817) (#11918) * Fix not convert issues * refine Co-authored-by: Guancheng Fu <110874468+gc-fu@users.noreply.github.com> * Fix glm4-9b-chat nan error on vllm 0.5.4 (#11969) * init * update mlp forward * fix minicpm error in vllm 0.5.4 * fix dependabot alerts (#12008) * Update 0.5.4 dockerfile (#12021) * Add vllm awq loading logic (#11987) * [ADD] Add vllm awq loading logic * [FIX] fix the module.linear_method path * [FIX] fix quant_config path error * Enable Qwen padding mlp to 256 to support batch_forward (#12030) * Enable padding mlp * padding to 256 * update style * Install 27191 runtime in 0.5.4 docker image (#12040) * fix rebase error * fix rebase error * vLLM: format for 0.5.4 rebase (#12043) * format * Update model_convert.py * Fix serving docker related modifications (#12046) * Fix undesired modifications (#12048) * fix * Refine offline_inference arguments --------- Co-authored-by: Xiangyu Tian <109123695+xiangyuT@users.noreply.github.com> Co-authored-by: Jun Wang <thoughts.times@gmail.com> Co-authored-by: Wang, Jian4 <61138589+hzjane@users.noreply.github.com> Co-authored-by: liu-shaojun <johnssalyn@outlook.com> Co-authored-by: Shaojun Liu <61072813+liu-shaojun@users.noreply.github.com>  | 
			||
|---|---|---|
| .. | ||
| finetune | ||
| inference | ||
| inference-cpp | ||
| serving | ||
| sources | ||
| README.md | ||
| README_backup.md | ||
IPEX-LLM Docker Containers
You can run IPEX-LLM containers (via docker or k8s) for inference, serving and fine-tuning on Intel CPU and GPU. Details on how to use these containers are available at IPEX-LLM Docker Container Guides.
Prerequisites
- Docker on Windows or Linux
 - Windows Subsystem for Linux (WSL) is required if using Windows.
 
Quick Start
Pull a IPEX-LLM Docker Image
To pull IPEX-LLM Docker images from Docker Hub, use the docker pull command. For instance, to pull the CPU inference image:
docker pull intelanalytics/ipex-llm-cpu:2.2.0-SNAPSHOT
Available images in hub are:
| Image Name | Description | 
|---|---|
| intelanalytics/ipex-llm-cpu:2.2.0-SNAPSHOT | CPU Inference | 
| intelanalytics/ipex-llm-xpu:2.2.0-SNAPSHOT | GPU Inference | 
| intelanalytics/ipex-llm-serving-cpu:2.2.0-SNAPSHOT | CPU Serving | 
| intelanalytics/ipex-llm-serving-xpu:2.2.0-SNAPSHOT | GPU Serving | 
| intelanalytics/ipex-llm-finetune-qlora-cpu-standalone:2.2.0-SNAPSHOT | CPU Finetuning via Docker | 
| intelanalytics/ipex-llm-finetune-qlora-cpu-k8s:2.2.0-SNAPSHOT | CPU Finetuning via Kubernetes | 
| intelanalytics/ipex-llm-finetune-qlora-xpu:2.2.0-SNAPSHOT | GPU Finetuning | 
Run a Container
Use docker run command to run an IPEX-LLM docker container. For detailed instructions, refer to the IPEX-LLM Docker Container Guides.
Build Docker Image
To build a Docker image from source, first clone the IPEX-LLM repository and navigate to the Dockerfile directory. For example, to build the CPU inference image, navigate to docker/llm/inference/cpu/docker.
Then, use the following command to build the image (replace your_image_name with your desired image name):
docker build \
  --build-arg no_proxy=localhost,127.0.0.1 \
  --rm --no-cache -t your_image_name .
Note: If you're working behind a proxy, also add args
--build-arg http_proxy=http://your_proxy_uri:portand--build-arg https_proxy=https://your_proxy_url:port