From 499100daf1ced16202d9f238b76d4f00ed2c7cb9 Mon Sep 17 00:00:00 2001 From: binbin Deng <108676127+plusbang@users.noreply.github.com> Date: Fri, 8 Dec 2023 10:51:55 +0800 Subject: [PATCH] LLM: Add solution to fix `oneccl` related error (#9630) --- .../example/GPU/QLoRA-FineTuning/alpaca-qlora/README.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/python/llm/example/GPU/QLoRA-FineTuning/alpaca-qlora/README.md b/python/llm/example/GPU/QLoRA-FineTuning/alpaca-qlora/README.md index bc8ca67e..98cfbae3 100644 --- a/python/llm/example/GPU/QLoRA-FineTuning/alpaca-qlora/README.md +++ b/python/llm/example/GPU/QLoRA-FineTuning/alpaca-qlora/README.md @@ -125,3 +125,10 @@ python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH -- ``` Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference. + +### 5. Troubleshooting +- If you fail to finetune on multi cards because of following error message: + ```bash + RuntimeError: oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized + ``` + Please try `sudo apt install level-zero-dev` to fix it.