From c1ec3d89216d2d56510b57d0e93998f7d23d4b95 Mon Sep 17 00:00:00 2001
From: binbin Deng <108676127+plusbang@users.noreply.github.com>
Date: Wed, 7 Feb 2024 15:02:24 +0800
Subject: [PATCH] LLM: update FAQ about too many open files (#10119)

---
 .../source/doc/LLM/Overview/FAQ/resolve_error.md       |  8 ++++++++
 python/llm/example/GPU/LLM-Finetuning/LoRA/README.md   |  6 +-----
 .../llm/example/GPU/LLM-Finetuning/QA-LoRA/README.md   |  6 +-----
 .../GPU/LLM-Finetuning/QLoRA/alpaca-qlora/README.md    |  6 +-----
 python/llm/example/GPU/LLM-Finetuning/README.md        | 10 ++++++++++
 python/llm/example/GPU/LLM-Finetuning/ReLora/README.md |  6 +-----
 6 files changed, 22 insertions(+), 20 deletions(-)

diff --git a/docs/readthedocs/source/doc/LLM/Overview/FAQ/resolve_error.md b/docs/readthedocs/source/doc/LLM/Overview/FAQ/resolve_error.md
index bf59b450..8d1fc312 100644
--- a/docs/readthedocs/source/doc/LLM/Overview/FAQ/resolve_error.md
+++ b/docs/readthedocs/source/doc/LLM/Overview/FAQ/resolve_error.md
@@ -53,3 +53,11 @@ This error is caused by out of GPU memory. Some possible solutions to decrease G
 ### failed to enable AMX
 
 You could use `export BIGDL_LLM_AMX_DISABLED=1` to disable AMX manually and solve this error.
+
+### oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
+
+You may encounter this error during finetuning on multi GPUs. Please try `sudo apt install level-zero-dev` to fix it.
+
+### Too many open files
+
+You may encounter this error during finetuning, expecially when run 70B model. Please raise the system open file limit using `ulimit -n 1048576`.
diff --git a/python/llm/example/GPU/LLM-Finetuning/LoRA/README.md b/python/llm/example/GPU/LLM-Finetuning/LoRA/README.md
index 6671aca1..73740a11 100644
--- a/python/llm/example/GPU/LLM-Finetuning/LoRA/README.md
+++ b/python/llm/example/GPU/LLM-Finetuning/LoRA/README.md
@@ -83,8 +83,4 @@ python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --
 Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference.
 
 ### 7. Troubleshooting
-- If you fail to finetune on multi cards because of following error message:
-  ```bash
-  RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
-  ```
-  Please try `sudo apt install level-zero-dev` to fix it.
+Please refer to [here](../README.md#troubleshooting) for solutions of common issues during finetuning.
diff --git a/python/llm/example/GPU/LLM-Finetuning/QA-LoRA/README.md b/python/llm/example/GPU/LLM-Finetuning/QA-LoRA/README.md
index f2579f9e..9b237298 100644
--- a/python/llm/example/GPU/LLM-Finetuning/QA-LoRA/README.md
+++ b/python/llm/example/GPU/LLM-Finetuning/QA-LoRA/README.md
@@ -77,8 +77,4 @@ python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --
 Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference.
 
 ### 7. Troubleshooting
-- If you fail to finetune on multi cards because of following error message:
-  ```bash
-  RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
-  ```
-  Please try `sudo apt install level-zero-dev` to fix it.
+Please refer to [here](../README.md#troubleshooting) for solutions of common issues during finetuning.
diff --git a/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/README.md b/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/README.md
index 18b9729e..afa0dc37 100644
--- a/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/README.md
+++ b/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/README.md
@@ -160,8 +160,4 @@ python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --
 Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference.
 
 ### 7. Troubleshooting
-- If you fail to finetune on multi cards because of following error message:
-  ```bash
-  RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
-  ```
-  Please try `sudo apt install level-zero-dev` to fix it.
+Please refer to [here](../../README.md#troubleshooting) for solutions of common issues during finetuning.
diff --git a/python/llm/example/GPU/LLM-Finetuning/README.md b/python/llm/example/GPU/LLM-Finetuning/README.md
index 114a73cb..8f667550 100644
--- a/python/llm/example/GPU/LLM-Finetuning/README.md
+++ b/python/llm/example/GPU/LLM-Finetuning/README.md
@@ -8,3 +8,13 @@ This folder contains examples of running different training mode with BigDL-LLM
 - [ReLora](ReLora): examples of running ReLora finetuning
 - [DPO](DPO): examples of running DPO finetuning
 - [common](common): common templates and utility classes in finetuning examples
+
+
+## Troubleshooting
+- If you fail to finetune on multi cards because of following error message:
+  ```bash
+  RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
+  ```
+  Please try `sudo apt install level-zero-dev` to fix it.
+
+- Please raise the system open file limit using `ulimit -n 1048576`. Otherwise, there may exist error `Too many open files`.
diff --git a/python/llm/example/GPU/LLM-Finetuning/ReLora/README.md b/python/llm/example/GPU/LLM-Finetuning/ReLora/README.md
index 112eae08..084a6ef7 100644
--- a/python/llm/example/GPU/LLM-Finetuning/ReLora/README.md
+++ b/python/llm/example/GPU/LLM-Finetuning/ReLora/README.md
@@ -83,8 +83,4 @@ python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --
 Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference.
 
 ### 7. Troubleshooting
-- If you fail to finetune on multi cards because of following error message:
-  ```bash
-  RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
-  ```
-  Please try `sudo apt install level-zero-dev` to fix it.
+Please refer to [here](../README.md#troubleshooting) for solutions of common issues during finetuning.