diff --git a/python/llm/example/GPU/Long-Context/README.md b/python/llm/example/GPU/Long-Context/README.md new file mode 100644 index 00000000..b97213fe --- /dev/null +++ b/python/llm/example/GPU/Long-Context/README.md @@ -0,0 +1,33 @@ +# Running Long-Context generation using IPEX-LLM on Intel Arc™ A770 Graphics + +Long-Context Generation is a critical aspect in various applications, such as document summarization, extended conversation handling, and complex question answering. Effective long-context generation can lead to more coherent and contextually relevant responses, enhancing user experience and model utility. + +This folder contains examples of running long-context generation with IPEX-LLM on Intel Arc™ A770 Graphics(16GB GPU memory): + + +- [LLaMA2-32K](LLaMA2-32K): examples of running LLaMA2-32K models with INT4/FP8 precision. +- [ChatGLM3-32K](Chatglm3-32K): examples of running ChatGLM3-32K models with INT4/FP8 precision. + +### Maximum Input Length for Different Models with INT4/FP8 Precision. + +- **INT4** + + | Model Name | Low Memory Mode | Maximum Input Length | Output Length | + | -- | -- | -- | -- | + | LLaMA2-7B-32K | Disable | 10K | 512 | + | | Enable | 12K | 512 | + | ChatGLM3-6B-32K | Disable | 9K | 512 | + | | Enable | 10K | 512 | + +- **FP8** + + | Model Name | Low Memory Mode | Maximum Input Length | Output Length | + | -- | -- | -- | -- | + | LLaMA2-7B-32K | Disable | 7K | 512 | + | | Enable | 9K | 512 | + | ChatGLM3-6B-32K | Disable | 8K | 512| + | | Enable | 9K | 512 | + +> Note: If you need to run longer input or use less memory, please set `IPEX_LLM_LOW_MEM=1` to enable **low memory mode**, which will enable memory optimization and may slightly affect the performance. \ No newline at end of file diff --git a/python/llm/example/GPU/README.md b/python/llm/example/GPU/README.md index dcda29af..924611e7 100644 --- a/python/llm/example/GPU/README.md +++ b/python/llm/example/GPU/README.md @@ -12,6 +12,7 @@ This folder contains examples of running IPEX-LLM on Intel GPU: - [PyTorch-Models](PyTorch-Models): running any PyTorch model on IPEX-LLM (with "one-line code change") - [Speculative-Decoding](Speculative-Decoding): running any ***Hugging Face Transformers*** model with ***self-speculative decoding*** on Intel GPUs - [ModelScope-Models](ModelScope-Models): running ***ModelScope*** model with IPEX-LLM on Intel GPUs +- [Long-Context](Long-Context): running **long-context** generation with IPEX-LLM on Intel Arc™ A770 Graphics. ## System Support