LLM: update FAQ about too many open files (#10119)
This commit is contained in:
parent
2e80701f58
commit
c1ec3d8921
6 changed files with 22 additions and 20 deletions
|
|
@ -53,3 +53,11 @@ This error is caused by out of GPU memory. Some possible solutions to decrease G
|
|||
### failed to enable AMX
|
||||
|
||||
You could use `export BIGDL_LLM_AMX_DISABLED=1` to disable AMX manually and solve this error.
|
||||
|
||||
### oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
|
||||
|
||||
You may encounter this error during finetuning on multi GPUs. Please try `sudo apt install level-zero-dev` to fix it.
|
||||
|
||||
### Too many open files
|
||||
|
||||
You may encounter this error during finetuning, expecially when run 70B model. Please raise the system open file limit using `ulimit -n 1048576`.
|
||||
|
|
|
|||
|
|
@ -83,8 +83,4 @@ python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --
|
|||
Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference.
|
||||
|
||||
### 7. Troubleshooting
|
||||
- If you fail to finetune on multi cards because of following error message:
|
||||
```bash
|
||||
RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
|
||||
```
|
||||
Please try `sudo apt install level-zero-dev` to fix it.
|
||||
Please refer to [here](../README.md#troubleshooting) for solutions of common issues during finetuning.
|
||||
|
|
|
|||
|
|
@ -77,8 +77,4 @@ python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --
|
|||
Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference.
|
||||
|
||||
### 7. Troubleshooting
|
||||
- If you fail to finetune on multi cards because of following error message:
|
||||
```bash
|
||||
RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
|
||||
```
|
||||
Please try `sudo apt install level-zero-dev` to fix it.
|
||||
Please refer to [here](../README.md#troubleshooting) for solutions of common issues during finetuning.
|
||||
|
|
|
|||
|
|
@ -160,8 +160,4 @@ python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --
|
|||
Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference.
|
||||
|
||||
### 7. Troubleshooting
|
||||
- If you fail to finetune on multi cards because of following error message:
|
||||
```bash
|
||||
RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
|
||||
```
|
||||
Please try `sudo apt install level-zero-dev` to fix it.
|
||||
Please refer to [here](../../README.md#troubleshooting) for solutions of common issues during finetuning.
|
||||
|
|
|
|||
|
|
@ -8,3 +8,13 @@ This folder contains examples of running different training mode with BigDL-LLM
|
|||
- [ReLora](ReLora): examples of running ReLora finetuning
|
||||
- [DPO](DPO): examples of running DPO finetuning
|
||||
- [common](common): common templates and utility classes in finetuning examples
|
||||
|
||||
|
||||
## Troubleshooting
|
||||
- If you fail to finetune on multi cards because of following error message:
|
||||
```bash
|
||||
RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
|
||||
```
|
||||
Please try `sudo apt install level-zero-dev` to fix it.
|
||||
|
||||
- Please raise the system open file limit using `ulimit -n 1048576`. Otherwise, there may exist error `Too many open files`.
|
||||
|
|
|
|||
|
|
@ -83,8 +83,4 @@ python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --
|
|||
Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference.
|
||||
|
||||
### 7. Troubleshooting
|
||||
- If you fail to finetune on multi cards because of following error message:
|
||||
```bash
|
||||
RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
|
||||
```
|
||||
Please try `sudo apt install level-zero-dev` to fix it.
|
||||
Please refer to [here](../README.md#troubleshooting) for solutions of common issues during finetuning.
|
||||
|
|
|
|||
Loading…
Reference in a new issue