add warmup advice in quickstart (#10293)

This commit is contained in:
Shengsheng Huang 2024-03-01 17:15:45 +08:00 committed by GitHub
parent 0ab40917fb
commit 1db20dd1d0

View file

@ -1,10 +1,10 @@
# Install BigDL-LLM on Windows for Intel GPU
# Install BigDL-LLM on Windows with Intel GPU
This guide demonstrates how to install BigDL-LLM on Windows with Intel GPUs.
It applies to Intel Core Ultra and Core 12 - 14 gen integrated GPUs (iGPUs), as well as Intel Arc Series GPU.
## Install GPU driver
## Install GPU Driver
* Download and Install Visual Studio 2022 Community Edition from the [official Microsoft Visual Studio website](https://visualstudio.microsoft.com/downloads/). Ensure you select the **Desktop development with C++ workload** during the installation process.
@ -63,7 +63,7 @@ It applies to Intel Core Ultra and Core 12 - 14 gen integrated GPUs (iGPUs), as
from bigdl.llm.transformers import AutoModel,AutoModelForCausalLM
```
## A quick example
## A Quick Example
Now let's play with a real LLM. We'll be using the [phi-1.5](https://huggingface.co/microsoft/phi-1_5) model, a 1.3 billion parameter LLM for this demostration. Follow the steps below to setup and run the model, and observe how it responds to a prompt "What is AI?".
@ -101,6 +101,8 @@ Now let's play with a real LLM. We'll be using the [phi-1.5](https://huggingface
# Generate predicted tokens
with torch.inference_mode():
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
# warm up one more time before the actual generation task for the first run, see details in `Tips & Troubleshooting`
# output = model.generate(input_ids, do_sample=False, max_new_tokens=32, generation_config = generation_config)
output = model.generate(input_ids, do_sample=False, max_new_tokens=32, generation_config = generation_config).cpu()
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_str)
@ -121,3 +123,8 @@ Now let's play with a real LLM. We'll be using the [phi-1.5](https://huggingface
Answer: AI stands for Artificial Intelligence, which is the simulation of human intelligence in machines.
```
## Tips & Troubleshooting
### Warmup for optimial performance on first run
When running LLMs on GPU for the first time, you might notice the performance is lower than expected, with delays up to several minutes before the first token is generated. This delay occurs because the GPU kernels require compilation and initialization, which varies across different GPU models. To achieve optimal and consistent performance, we recommend a one-time warm-up by running `model.generate(...)` an additional time before starting your actual generation tasks. If you're developing an application, you can incorporate this warmup step into start-up or loading routine to enhance the user experience.