[Docker] Fix image using two cards error (#11144)

* fix all

* done
This commit is contained in:
Guancheng Fu 2024-05-27 16:20:13 +08:00 committed by GitHub
parent 34dab3b4ef
commit daf7b1cd56
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
3 changed files with 29 additions and 3 deletions

View file

@ -44,7 +44,7 @@ RUN wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRO
pip install transformers_stream_generator einops tiktoken && \ pip install transformers_stream_generator einops tiktoken && \
# Install opencl-related repos # Install opencl-related repos
apt-get update && \ apt-get update && \
apt-get install -y intel-opencl-icd intel-level-zero-gpu=1.3.26241.33-647~22.04 level-zero level-zero-dev --allow-downgrades && \ apt-get install -y intel-opencl-icd intel-level-zero-gpu level-zero && \
# Install related libary of chat.py # Install related libary of chat.py
pip install --upgrade colorama && \ pip install --upgrade colorama && \
# Download all-in-one benchmark and examples # Download all-in-one benchmark and examples

View file

@ -114,9 +114,32 @@ python3 -m ipex_llm.serving.fastchat.vllm_worker --model-path REPO_ID_OR_YOUR_MO
source /opt/intel/oneapi/setvars.sh source /opt/intel/oneapi/setvars.sh
export USE_XETLA=OFF export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
python3 -m ipex_llm.serving.fastchat.vllm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --device xpu python3 -m ipex_llm.serving.fastchat.vllm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --device xpu --load-in-low-bit "sym_int4" --enforce-eager
``` ```
#### Launch multiple workers
Sometimes we may want to start multiple workers for the best performance. For running in CPU, you may want to seperate multiple workers in different sockets. Assuming each socket have 48 physicall cores, then you may want to start two workers using the following example:
```bash
export OMP_NUM_THREADS=48
numactl -C 0-47 -m 0 python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "cpu" &
# All the workers other than the first worker need to specify a different worker port and corresponding worker-address
numactl -C 48-95 -m 1 python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "cpu" --port 21003 --worker-address "http://localhost:21003" &
```
For GPU, we may want to start two workers using different GPUs. To achieve this, you should use `ZE_AFFINITY_MASK` environment variable to select different GPUs for different workers. Below shows an example:
```bash
ZE_AFFINITY_MASK=1 python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "xpu" &
# All the workers other than the first worker need to specify a different worker port and corresponding worker-address
ZE_AFFINITY_MASK=2 python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "xpu" --port 21003 --worker-address "http://localhost:21003" &
```
If you are not sure the effect of `ZE_AFFINITY_MASK`, then you could set `ZE_AFFINITY_MASK` and check the result of `sycl-ls`.
### Launch Gradio web server ### Launch Gradio web server
When you have started the controller and the worker, you can start web server as follows: When you have started the controller and the worker, you can start web server as follows:

View file

@ -41,6 +41,9 @@ from fastchat.serve.model_worker import (
worker_id, worker_id,
) )
from fastchat.utils import get_context_length, is_partial_stop from fastchat.utils import get_context_length, is_partial_stop
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from ipex_llm.vllm.cpu.engine import IPEXLLMAsyncLLMEngine as AsyncLLMEngine
app = FastAPI() app = FastAPI()
@ -56,7 +59,7 @@ class VLLMWorker(BaseModelWorker):
model_names: List[str], model_names: List[str],
limit_worker_concurrency: int, limit_worker_concurrency: int,
no_register: bool, no_register: bool,
llm_engine: AsyncLLMEngine, llm_engine: 'AsyncLLMEngine',
conv_template: str, conv_template: str,
): ):
super().__init__( super().__init__(