From 02ec313eab6adfc02602b75eaf544a1cee78226a Mon Sep 17 00:00:00 2001 From: Guancheng Fu <110874468+gc-fu@users.noreply.github.com> Date: Mon, 24 Feb 2025 09:59:17 +0800 Subject: [PATCH] Update README.md (#12877) --- python/llm/example/GPU/vLLM-Serving/README.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/python/llm/example/GPU/vLLM-Serving/README.md b/python/llm/example/GPU/vLLM-Serving/README.md index 1ef70a39..7c857686 100644 --- a/python/llm/example/GPU/vLLM-Serving/README.md +++ b/python/llm/example/GPU/vLLM-Serving/README.md @@ -241,3 +241,10 @@ llm = LLM(model="DeepSeek-R1-Distill-Qwen-7B", # Unquantized model path on disk When finish executing, the low-bit model has been saved at `/llm/fp8-model-path`. Later we can use the option `--low-bit-model-path /llm/fp8-model-path` to use the low-bit model. + + +### 5. Known issues + +#### Runtime memory + +If runtime memory is a concern, you can set --swap-space 0.5 to reduce memory consumption during execution. The default value for --swap-space is 4, which means that by default, the system reserves 4GB of memory for use when GPU memory is insufficient.