From dd303776cf9fbd863b3a2a5cc0764da29239feb5 Mon Sep 17 00:00:00 2001
From: binbin Deng <108676127+plusbang@users.noreply.github.com>
Date: Mon, 26 Aug 2024 16:06:32 +0800
Subject: [PATCH] Add troubleshooting about transpose value setting

---
 .../NPU/HF-Transformers-AutoModels/LLM/README.md     | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md
index 111f1480..12bce0de 100644
--- a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md
+++ b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md
@@ -130,6 +130,18 @@ Arguments info:
 - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 - `--max-output-len MAX_OUTPUT_LEN`: Defines the maximum sequence length for both input and output tokens. It is default to be `1024`.
 - `--max-prompt-len MAX_PROMPT_LEN`: Defines the maximum number of tokens that the input prompt can contain. It is default to be `512`.
+- `--disable-transpose-value-cache`: Disable the optimization of transposing value cache.
+
+### 4. Troubleshooting
+
+If you encounter output problem, please try to disable the optimization of transposing value cache with following command:
+```bash
+# to run Llama-2-7b-chat-hf
+python  llama2.py --disable-transpose-value-cache
+
+# to run Qwen2-1.5B-Instruct
+python qwen2.py --disable-transpose-value-cache
+```
 
 
 #### Sample Output