From c1ec3d89216d2d56510b57d0e93998f7d23d4b95 Mon Sep 17 00:00:00 2001 From: binbin Deng <108676127+plusbang@users.noreply.github.com> Date: Wed, 7 Feb 2024 15:02:24 +0800 Subject: [PATCH] LLM: update FAQ about too many open files (#10119) --- .../source/doc/LLM/Overview/FAQ/resolve_error.md | 8 ++++++++ python/llm/example/GPU/LLM-Finetuning/LoRA/README.md | 6 +----- .../llm/example/GPU/LLM-Finetuning/QA-LoRA/README.md | 6 +----- .../GPU/LLM-Finetuning/QLoRA/alpaca-qlora/README.md | 6 +----- python/llm/example/GPU/LLM-Finetuning/README.md | 10 ++++++++++ python/llm/example/GPU/LLM-Finetuning/ReLora/README.md | 6 +----- 6 files changed, 22 insertions(+), 20 deletions(-) diff --git a/docs/readthedocs/source/doc/LLM/Overview/FAQ/resolve_error.md b/docs/readthedocs/source/doc/LLM/Overview/FAQ/resolve_error.md index bf59b450..8d1fc312 100644 --- a/docs/readthedocs/source/doc/LLM/Overview/FAQ/resolve_error.md +++ b/docs/readthedocs/source/doc/LLM/Overview/FAQ/resolve_error.md @@ -53,3 +53,11 @@ This error is caused by out of GPU memory. Some possible solutions to decrease G ### failed to enable AMX You could use `export BIGDL_LLM_AMX_DISABLED=1` to disable AMX manually and solve this error. + +### oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized + +You may encounter this error during finetuning on multi GPUs. Please try `sudo apt install level-zero-dev` to fix it. + +### Too many open files + +You may encounter this error during finetuning, expecially when run 70B model. Please raise the system open file limit using `ulimit -n 1048576`. diff --git a/python/llm/example/GPU/LLM-Finetuning/LoRA/README.md b/python/llm/example/GPU/LLM-Finetuning/LoRA/README.md index 6671aca1..73740a11 100644 --- a/python/llm/example/GPU/LLM-Finetuning/LoRA/README.md +++ b/python/llm/example/GPU/LLM-Finetuning/LoRA/README.md @@ -83,8 +83,4 @@ python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH -- Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference. ### 7. Troubleshooting -- If you fail to finetune on multi cards because of following error message: - ```bash - RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized - ``` - Please try `sudo apt install level-zero-dev` to fix it. +Please refer to [here](../README.md#troubleshooting) for solutions of common issues during finetuning. diff --git a/python/llm/example/GPU/LLM-Finetuning/QA-LoRA/README.md b/python/llm/example/GPU/LLM-Finetuning/QA-LoRA/README.md index f2579f9e..9b237298 100644 --- a/python/llm/example/GPU/LLM-Finetuning/QA-LoRA/README.md +++ b/python/llm/example/GPU/LLM-Finetuning/QA-LoRA/README.md @@ -77,8 +77,4 @@ python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH -- Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference. ### 7. Troubleshooting -- If you fail to finetune on multi cards because of following error message: - ```bash - RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized - ``` - Please try `sudo apt install level-zero-dev` to fix it. +Please refer to [here](../README.md#troubleshooting) for solutions of common issues during finetuning. diff --git a/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/README.md b/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/README.md index 18b9729e..afa0dc37 100644 --- a/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/README.md +++ b/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/README.md @@ -160,8 +160,4 @@ python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH -- Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference. ### 7. Troubleshooting -- If you fail to finetune on multi cards because of following error message: - ```bash - RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized - ``` - Please try `sudo apt install level-zero-dev` to fix it. +Please refer to [here](../../README.md#troubleshooting) for solutions of common issues during finetuning. diff --git a/python/llm/example/GPU/LLM-Finetuning/README.md b/python/llm/example/GPU/LLM-Finetuning/README.md index 114a73cb..8f667550 100644 --- a/python/llm/example/GPU/LLM-Finetuning/README.md +++ b/python/llm/example/GPU/LLM-Finetuning/README.md @@ -8,3 +8,13 @@ This folder contains examples of running different training mode with BigDL-LLM - [ReLora](ReLora): examples of running ReLora finetuning - [DPO](DPO): examples of running DPO finetuning - [common](common): common templates and utility classes in finetuning examples + + +## Troubleshooting +- If you fail to finetune on multi cards because of following error message: + ```bash + RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized + ``` + Please try `sudo apt install level-zero-dev` to fix it. + +- Please raise the system open file limit using `ulimit -n 1048576`. Otherwise, there may exist error `Too many open files`. diff --git a/python/llm/example/GPU/LLM-Finetuning/ReLora/README.md b/python/llm/example/GPU/LLM-Finetuning/ReLora/README.md index 112eae08..084a6ef7 100644 --- a/python/llm/example/GPU/LLM-Finetuning/ReLora/README.md +++ b/python/llm/example/GPU/LLM-Finetuning/ReLora/README.md @@ -83,8 +83,4 @@ python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH -- Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference. ### 7. Troubleshooting -- If you fail to finetune on multi cards because of following error message: - ```bash - RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized - ``` - Please try `sudo apt install level-zero-dev` to fix it. +Please refer to [here](../README.md#troubleshooting) for solutions of common issues during finetuning.