LLM: update FAQ about too many open files (#10119)

2024-02-07 15:02:24 +08:00 · 2024-02-07 15:02:24 +08:00 · c1ec3d8921
commit c1ec3d8921
parent 2e80701f58
6 changed files with 22 additions and 20 deletions
--- a/docs/readthedocs/source/doc/LLM/Overview/FAQ/resolve_error.md
+++ b/docs/readthedocs/source/doc/LLM/Overview/FAQ/resolve_error.md
@ -53,3 +53,11 @@ This error is caused by out of GPU memory. Some possible solutions to decrease G
 ### failed to enable AMX
 You could use `export BIGDL_LLM_AMX_DISABLED=1` to disable AMX manually and solve this error.
 ### oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
 You may encounter this error during finetuning on multi GPUs. Please try `sudo apt install level-zero-dev` to fix it.
 ### Too many open files
 You may encounter this error during finetuning, expecially when run 70B model. Please raise the system open file limit using `ulimit -n 1048576`.
--- a/python/llm/example/GPU/LLM-Finetuning/LoRA/README.md
+++ b/python/llm/example/GPU/LLM-Finetuning/LoRA/README.md
@ -83,8 +83,4 @@ python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --
 Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference.
 ### 7. Troubleshooting
- If you fail to finetune on multi cards because of following error message:
+Please refer to [here](../README.md#troubleshooting) for solutions of common issues during finetuning.
  ```bash
  RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
  ```
  Please try `sudo apt install level-zero-dev` to fix it.
--- a/python/llm/example/GPU/LLM-Finetuning/QA-LoRA/README.md
+++ b/python/llm/example/GPU/LLM-Finetuning/QA-LoRA/README.md
@ -77,8 +77,4 @@ python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --
 Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference.
 ### 7. Troubleshooting
- If you fail to finetune on multi cards because of following error message:
+Please refer to [here](../README.md#troubleshooting) for solutions of common issues during finetuning.
  ```bash
  RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
  ```
  Please try `sudo apt install level-zero-dev` to fix it.
--- a/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/README.md
+++ b/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/README.md
@ -160,8 +160,4 @@ python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --
 Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference.
 ### 7. Troubleshooting
- If you fail to finetune on multi cards because of following error message:
+Please refer to [here](../../README.md#troubleshooting) for solutions of common issues during finetuning.
  ```bash
  RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
  ```
  Please try `sudo apt install level-zero-dev` to fix it.
--- a/python/llm/example/GPU/LLM-Finetuning/README.md
+++ b/python/llm/example/GPU/LLM-Finetuning/README.md
@ -8,3 +8,13 @@ This folder contains examples of running different training mode with BigDL-LLM
 - [ReLora](ReLora): examples of running ReLora finetuning
 - [DPO](DPO): examples of running DPO finetuning
 - [common](common): common templates and utility classes in finetuning examples
 ## Troubleshooting
 - If you fail to finetune on multi cards because of following error message:
  ```bash
  RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
  ```
  Please try `sudo apt install level-zero-dev` to fix it.
 - Please raise the system open file limit using `ulimit -n 1048576`. Otherwise, there may exist error `Too many open files`.
--- a/python/llm/example/GPU/LLM-Finetuning/ReLora/README.md
+++ b/python/llm/example/GPU/LLM-Finetuning/ReLora/README.md
@ -83,8 +83,4 @@ python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --
 Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference.
 ### 7. Troubleshooting
- If you fail to finetune on multi cards because of following error message:
+Please refer to [here](../README.md#troubleshooting) for solutions of common issues during finetuning.
  ```bash
  RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
  ```
  Please try `sudo apt install level-zero-dev` to fix it.