From 5ebfaa9a77b55be5a41b981f962888b5db732c8e Mon Sep 17 00:00:00 2001
From: Cengguang Zhang <potterguang101@gmail.com>
Date: Fri, 21 Apr 2023 14:48:33 +0800
Subject: [PATCH] Doc: Add known issues for Orca. (#8096)

* Doc: add known issues for orca.

* fix: fix style.

* fix: style.
---
 .../source/doc/Orca/Overview/known_issues.md  | 36 +++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/docs/readthedocs/source/doc/Orca/Overview/known_issues.md b/docs/readthedocs/source/doc/Orca/Overview/known_issues.md
index 8affb0d4..33f6add2 100644
--- a/docs/readthedocs/source/doc/Orca/Overview/known_issues.md
+++ b/docs/readthedocs/source/doc/Orca/Overview/known_issues.md
@@ -188,3 +188,39 @@ To solve this issue, please follow the steps below:
    ```bash
    sudo systemctl restart service_name
    ```
+
+### Current incarnation doesn't match with one in the group
+
+Full error log example:
+```shell
+tensorflow.python.framework.errors_impl.FailedPreconditionError: Collective ops is aborted by: Device /job:worker/replica:0/task:14/device:CPU:0 current incarnation doesn't match with one in the group. This usually means this worker has restarted but the collective leader hasn't, or this worker connects to a wrong cluster. Additional GRPC error information from remote target /job:worker/replica:0/task:0: :{"created":"@1681905587.420462284","description":"Error received from peer ipv4:172.16.0.150:47999","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Device /job:worker/replica:0/task:14/device:CPU:0 current incarnation doesn't match with one in the group. This usually means this worker has restarted but the collective leader hasn't, or this worker connects to a wrong cluster.","grpc_status":9} The error could be from a previous operation. Restart your program to reset. [Op:CollectiveReduceV2]
+```
+This error may happen when Spark reduce locality shuffle is enabled. To eliminate this issue, you can disable it by setting `spark.shuffle.reduceLocality.enabled` to false, or load property file named `spark-bigdl.conf` from bigdl release package.
+
+```shell
+# set spark.shuffle.reduceLocality.enabled to false
+spark-submit \
+   --conf spark.shuffle.reduceLocality.enabled=false \
+   ...
+
+# load property file
+spark-submit \
+   --properties-file /path/to/spark-bigdl.conf \
+   ...
+```
+
+### Start Spark task before all executor is scheduled.
+
+This issue may lead to slower data processing. To avoid this, you can set `spark.scheduler.maxRegisteredResourcesWaitingTime` to a larger number, the default value is `30s`. Or you can load property file named `spark-bigdl.conf` from bigdl release package.
+
+```shell
+# set spark.scheduler.maxRegisteredResourcesWaitingTime
+spark-submit \
+   --conf spark.scheduler.maxRegisteredResourcesWaitingTime=3600s \
+   ...
+
+# load property file
+spark-submit \
+   --properties-file /path/to/spark-bigdl.conf \
+   ...
+```
\ No newline at end of file