LLM: update FAQ about too many open files (#10119)
This commit is contained in:
		
							parent
							
								
									2e80701f58
								
							
						
					
					
						commit
						c1ec3d8921
					
				
					 6 changed files with 22 additions and 20 deletions
				
			
		| 
						 | 
					@ -53,3 +53,11 @@ This error is caused by out of GPU memory. Some possible solutions to decrease G
 | 
				
			||||||
### failed to enable AMX
 | 
					### failed to enable AMX
 | 
				
			||||||
 | 
					
 | 
				
			||||||
You could use `export BIGDL_LLM_AMX_DISABLED=1` to disable AMX manually and solve this error.
 | 
					You could use `export BIGDL_LLM_AMX_DISABLED=1` to disable AMX manually and solve this error.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					You may encounter this error during finetuning on multi GPUs. Please try `sudo apt install level-zero-dev` to fix it.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Too many open files
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					You may encounter this error during finetuning, expecially when run 70B model. Please raise the system open file limit using `ulimit -n 1048576`.
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -83,8 +83,4 @@ python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --
 | 
				
			||||||
Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference.
 | 
					Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 7. Troubleshooting
 | 
					### 7. Troubleshooting
 | 
				
			||||||
- If you fail to finetune on multi cards because of following error message:
 | 
					Please refer to [here](../README.md#troubleshooting) for solutions of common issues during finetuning.
 | 
				
			||||||
  ```bash
 | 
					 | 
				
			||||||
  RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
 | 
					 | 
				
			||||||
  ```
 | 
					 | 
				
			||||||
  Please try `sudo apt install level-zero-dev` to fix it.
 | 
					 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -77,8 +77,4 @@ python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --
 | 
				
			||||||
Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference.
 | 
					Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 7. Troubleshooting
 | 
					### 7. Troubleshooting
 | 
				
			||||||
- If you fail to finetune on multi cards because of following error message:
 | 
					Please refer to [here](../README.md#troubleshooting) for solutions of common issues during finetuning.
 | 
				
			||||||
  ```bash
 | 
					 | 
				
			||||||
  RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
 | 
					 | 
				
			||||||
  ```
 | 
					 | 
				
			||||||
  Please try `sudo apt install level-zero-dev` to fix it.
 | 
					 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -160,8 +160,4 @@ python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --
 | 
				
			||||||
Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference.
 | 
					Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 7. Troubleshooting
 | 
					### 7. Troubleshooting
 | 
				
			||||||
- If you fail to finetune on multi cards because of following error message:
 | 
					Please refer to [here](../../README.md#troubleshooting) for solutions of common issues during finetuning.
 | 
				
			||||||
  ```bash
 | 
					 | 
				
			||||||
  RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
 | 
					 | 
				
			||||||
  ```
 | 
					 | 
				
			||||||
  Please try `sudo apt install level-zero-dev` to fix it.
 | 
					 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -8,3 +8,13 @@ This folder contains examples of running different training mode with BigDL-LLM
 | 
				
			||||||
- [ReLora](ReLora): examples of running ReLora finetuning
 | 
					- [ReLora](ReLora): examples of running ReLora finetuning
 | 
				
			||||||
- [DPO](DPO): examples of running DPO finetuning
 | 
					- [DPO](DPO): examples of running DPO finetuning
 | 
				
			||||||
- [common](common): common templates and utility classes in finetuning examples
 | 
					- [common](common): common templates and utility classes in finetuning examples
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Troubleshooting
 | 
				
			||||||
 | 
					- If you fail to finetune on multi cards because of following error message:
 | 
				
			||||||
 | 
					  ```bash
 | 
				
			||||||
 | 
					  RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
 | 
				
			||||||
 | 
					  ```
 | 
				
			||||||
 | 
					  Please try `sudo apt install level-zero-dev` to fix it.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					- Please raise the system open file limit using `ulimit -n 1048576`. Otherwise, there may exist error `Too many open files`.
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -83,8 +83,4 @@ python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --
 | 
				
			||||||
Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference.
 | 
					Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### 7. Troubleshooting
 | 
					### 7. Troubleshooting
 | 
				
			||||||
- If you fail to finetune on multi cards because of following error message:
 | 
					Please refer to [here](../README.md#troubleshooting) for solutions of common issues during finetuning.
 | 
				
			||||||
  ```bash
 | 
					 | 
				
			||||||
  RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
 | 
					 | 
				
			||||||
  ```
 | 
					 | 
				
			||||||
  Please try `sudo apt install level-zero-dev` to fix it.
 | 
					 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
		Loading…
	
		Reference in a new issue