ipex-llm/python
Qiyuan Gong 9e18ea187f [LLM] Avoid KV Cache OOM when seq len is larger than 1 (#10006)
* Avoid OOM during muti-round streaming chat with kv cache
* For llama like kv cache, i.e., [bs, n_head, seq_len, head_dim], use is_enough_kv_cache_room_4_31.
* Other models need to compare kv cache size with kv_len.
2024-01-26 17:30:08 +08:00
..
llm [LLM] Avoid KV Cache OOM when seq len is larger than 1 (#10006) 2024-01-26 17:30:08 +08:00