From dd303776cf9fbd863b3a2a5cc0764da29239feb5 Mon Sep 17 00:00:00 2001 From: binbin Deng <108676127+plusbang@users.noreply.github.com> Date: Mon, 26 Aug 2024 16:06:32 +0800 Subject: [PATCH] Add troubleshooting about transpose value setting --- .../NPU/HF-Transformers-AutoModels/LLM/README.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md index 111f1480..12bce0de 100644 --- a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md +++ b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md @@ -130,6 +130,18 @@ Arguments info: - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. - `--max-output-len MAX_OUTPUT_LEN`: Defines the maximum sequence length for both input and output tokens. It is default to be `1024`. - `--max-prompt-len MAX_PROMPT_LEN`: Defines the maximum number of tokens that the input prompt can contain. It is default to be `512`. +- `--disable-transpose-value-cache`: Disable the optimization of transposing value cache. + +### 4. Troubleshooting + +If you encounter output problem, please try to disable the optimization of transposing value cache with following command: +```bash +# to run Llama-2-7b-chat-hf +python  llama2.py --disable-transpose-value-cache + +# to run Qwen2-1.5B-Instruct +python qwen2.py --disable-transpose-value-cache +``` #### Sample Output