LLM: update quickstart Windows gpu install guide & other quickstart doc style (#10365)
* init * fix doc style, add modelscope and tutorial * fix web ui doc style * add exit way * fix * fix modelscope note * fix according to comment * fix according to comment * fix * fix according to comments * fix * fix * fix * fix style * try fix * fix * fix * Small updates --------- Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>
This commit is contained in:
parent
28c4a8cf5c
commit
c2fb17bd43
2 changed files with 300 additions and 89 deletions
|
|
@ -6,23 +6,32 @@ It applies to Intel Core Ultra and Core 12 - 14 gen integrated GPUs (iGPUs), as
|
|||
|
||||
## Install Visual Studio 2022
|
||||
|
||||
* Download and Install Visual Studio 2022 Community Edition from the [official Microsoft Visual Studio website](https://visualstudio.microsoft.com/downloads/). Ensure you select the **Desktop development with C++ workload** during the installation process.
|
||||
Download and Install Visual Studio 2022 Community Edition from the [official Microsoft Visual Studio website](https://visualstudio.microsoft.com/downloads/). Ensure you select the **Desktop development with C++ workload** during the installation process.
|
||||
|
||||
> Note: The installation could take around 15 minutes, and requires at least 7GB of free disk space.
|
||||
> If you accidentally skip adding the **Desktop development with C++ workload** during the initial setup, you can add it afterward by navigating to **Tools > Get Tools and Features...**. Follow the instructions on [this Microsoft guide](https://learn.microsoft.com/en-us/cpp/build/vscpp-step-0-installation?view=msvc-170#step-4---choose-workloads) to update your installation.
|
||||
>
|
||||
> <img src="https://llm-assets.readthedocs.io/en/latest/_images/quickstart_windows_gpu_1.png" alt="image-20240221102252560" width=100%; />
|
||||
```eval_rst
|
||||
.. note::
|
||||
|
||||
The installation could take around 15 minutes, and requires at least 7GB of free disk space.
|
||||
If you accidentally skip adding the **Desktop development with C++ workload** during the initial setup, you can add it afterward by navigating to **Tools > Get Tools and Features...**. Follow the instructions on `this Microsoft guide <https://learn.microsoft.com/en-us/cpp/build/vscpp-step-0-installation?view=msvc-170#step-4---choose-workloads>`_ to update your installation.
|
||||
```
|
||||
|
||||
<img src="https://llm-assets.readthedocs.io/en/latest/_images/quickstart_windows_gpu_1.png" alt="image-20240221102252560" width=100%; />
|
||||
|
||||
## Install GPU Driver
|
||||
|
||||
* Download and install the latest GPU driver from the [official Intel download page](https://www.intel.com/content/www/us/en/download/785597/intel-arc-iris-xe-graphics-windows.html). A system reboot is necessary to apply the changes after the installation is complete.
|
||||
Download and install the latest GPU driver from the [official Intel download page](https://www.intel.com/content/www/us/en/download/785597/intel-arc-iris-xe-graphics-windows.html). A system reboot is necessary to apply the changes after the installation is complete.
|
||||
|
||||
> Note: The process could take around 10 minutes. After reboot, check for the **Intel Arc Control** application to verify the driver has been installed correctly. If the installation was successful, you should see the **Arc Control** interface similar to the figure below
|
||||
```eval_rst
|
||||
.. note::
|
||||
|
||||
> <img src="https://llm-assets.readthedocs.io/en/latest/_images/quickstart_windows_gpu_3.png" width=80%; />
|
||||
The process could take around 10 minutes. After reboot, check for the **Intel Arc Control** application to verify the driver has been installed correctly. If the installation was successful, you should see the **Arc Control** interface similar to the figure below
|
||||
```
|
||||
|
||||
* To monitor your GPU's performance and status, you can use either the **Windows Task Manager** (see the left side of the figure below) or the **Arc Control** application (see the right side of the figure below) :
|
||||
> <img src="https://llm-assets.readthedocs.io/en/latest/_images/quickstart_windows_gpu_4.png" width=70%; />
|
||||
<img src="https://llm-assets.readthedocs.io/en/latest/_images/quickstart_windows_gpu_3.png" width=100%; />
|
||||
|
||||
To monitor your GPU's performance and status, you can use either the **Windows Task Manager** (see the left side of the figure below) or the **Arc Control** application (see the right side of the figure below)
|
||||
|
||||
<img src="https://llm-assets.readthedocs.io/en/latest/_images/quickstart_windows_gpu_4.png" width=100%; />
|
||||
|
||||
## Install oneAPI
|
||||
|
||||
|
|
@ -31,94 +40,284 @@ It applies to Intel Core Ultra and Core 12 - 14 gen integrated GPUs (iGPUs), as
|
|||
pip install dpcpp-cpp-rt==2024.0.2 mkl-dpcpp==2024.0.0 onednn==2024.0.0
|
||||
``` -->
|
||||
|
||||
* Download and install the [**Intel oneAPI Base Toolkit**](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html?operatingsystem=window&distributions=offline). During installation, you can continue with the default installation settings.
|
||||
Download and install the [**Intel oneAPI Base Toolkit**](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html?operatingsystem=window&distributions=offline). During installation, you can continue with the default installation settings.
|
||||
|
||||
> <img src="https://llm-assets.readthedocs.io/en/latest/_images/quickstart_windows_gpu_oneapi_offline_installer.png" width=90%; />
|
||||
<img src="https://llm-assets.readthedocs.io/en/latest/_images/quickstart_windows_gpu_oneapi_offline_installer.png" width=100%; />
|
||||
|
||||
```eval_rst
|
||||
.. tip::
|
||||
|
||||
If the oneAPI installation hangs at the finalization step for more than 10 minutes, the error might be due to a problematic install of Visual Studio. Please reboot your computer and then launch the Visual Studio installer. If you see installation error messages, please repair your Visual Studio installation. After the repair is done, oneAPI installation is completed successfully.
|
||||
```
|
||||
|
||||
## Setup Python Environment
|
||||
|
||||
* Visit [Miniconda installation page](https://docs.anaconda.com/free/miniconda/), download the **Miniconda installer for Windows**, and follow the instructions to complete the installation.
|
||||
Visit [Miniconda installation page](https://docs.anaconda.com/free/miniconda/), download the **Miniconda installer for Windows**, and follow the instructions to complete the installation.
|
||||
|
||||
> <img src="https://llm-assets.readthedocs.io/en/latest/_images/quickstart_windows_gpu_5.png" width=50%; />
|
||||
<div align="center">
|
||||
<img src="https://llm-assets.readthedocs.io/en/latest/_images/quickstart_windows_gpu_5.png" width=70%/>
|
||||
</div>
|
||||
|
||||
* After installation, open the **Anaconda Prompt**, create a new python environment `llm`:
|
||||
```cmd
|
||||
conda create -n llm python=3.9 libuv
|
||||
```
|
||||
* Activate the newly created environment `llm`:
|
||||
```cmd
|
||||
conda activate llm
|
||||
```
|
||||
After installation, open the **Anaconda Prompt**, create a new python environment `llm`:
|
||||
```cmd
|
||||
conda create -n llm python=3.9 libuv
|
||||
```
|
||||
Activate the newly created environment `llm`:
|
||||
```cmd
|
||||
conda activate llm
|
||||
```
|
||||
|
||||
## Install `bigdl-llm`
|
||||
|
||||
* With the `llm` environment active, use `pip` to install `bigdl-llm` for GPU:
|
||||
Choose either US or CN website for `extra-index-url`:
|
||||
* US:
|
||||
```cmd
|
||||
With the `llm` environment active, use `pip` to install `bigdl-llm` for GPU:
|
||||
Choose either US or CN website for `extra-index-url`:
|
||||
|
||||
```eval_rst
|
||||
.. tabs::
|
||||
.. tab:: US
|
||||
|
||||
.. code-block:: cmd
|
||||
|
||||
pip install --pre --upgrade bigdl-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
|
||||
```
|
||||
* CN:
|
||||
```cmd
|
||||
|
||||
.. tab:: CN
|
||||
|
||||
.. code-block:: cmd
|
||||
|
||||
pip install --pre --upgrade bigdl-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/
|
||||
```
|
||||
> Note: If you encounter network issues while installing IPEX, refer to [this guide](https://bigdl.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#install-bigdl-llm-from-wheel) for troubleshooting advice.
|
||||
```
|
||||
|
||||
* You can verfy if bigdl-llm is successfully by simply importing a few classes from the library. For example, in the Python interactive shell, execute the following import command:
|
||||
```python
|
||||
from bigdl.llm.transformers import AutoModel,AutoModelForCausalLM
|
||||
```
|
||||
```eval_rst
|
||||
.. note::
|
||||
|
||||
## A Quick Example
|
||||
If you encounter network issues while installing IPEX, refer to `this guide <https://bigdl.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#install-bigdl-llm-from-wheel>`_ for troubleshooting advice.
|
||||
```
|
||||
|
||||
Now let's play with a real LLM. We'll be using the [phi-1.5](https://huggingface.co/microsoft/phi-1_5) model, a 1.3 billion parameter LLM for this demostration. Follow the steps below to setup and run the model, and observe how it responds to a prompt "What is AI?".
|
||||
## Verify Installation
|
||||
You can verify if `bigdl-llm` is successfully installed by simply running a few lines of code:
|
||||
|
||||
* Step 1: Open the **Anaconda Prompt** and activate the Python environment `llm` you previously created:
|
||||
```cmd
|
||||
conda activate llm
|
||||
```
|
||||
* Step 2: Configure oneAPI variables by running the following command:
|
||||
> For more details about runtime configurations, refer to [this guide](https://bigdl.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#runtime-configuration):
|
||||
```cmd
|
||||
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
|
||||
```
|
||||
If you're running on iGPU, set additional environment variables by running the following commands:
|
||||
```cmd
|
||||
* Step 3:
|
||||
Please also set the following environment variable according to your device:
|
||||
|
||||
```eval_rst
|
||||
.. tabs::
|
||||
.. tab:: Intel iGPU
|
||||
|
||||
.. code-block:: cmd
|
||||
|
||||
set SYCL_CACHE_PERSISTENT=1
|
||||
set BIGDL_LLM_XMX_DISABLED=1
|
||||
|
||||
.. tab:: Intel Arc™ A770
|
||||
|
||||
There is no need to set further environment variables.
|
||||
```
|
||||
* Step 3: To ensure compatibility with `phi-1.5`, update the transformers library to version 4.37.0:
|
||||
```cmd
|
||||
pip install -U transformers==4.37.0
|
||||
|
||||
```eval_rst
|
||||
.. seealso::
|
||||
|
||||
For other Intel dGPU Series, please refer to `this guide <https://bigdl.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#runtime-configuration>`_ for more details regarding runtime configuration.
|
||||
```
|
||||
* Step 4: Create a new file named `demo.py` and insert the code snippet below.
|
||||
* Step 4: Launch the Python interactive shell by typing `python` in the Anaconda prompt window and then press Enter.
|
||||
|
||||
* Step 5: Copy following code to Anaconda prompt **line by line** and press Enter **after copying each line**.
|
||||
```python
|
||||
import torch
|
||||
from bigdl.llm.transformers import AutoModel,AutoModelForCausalLM
|
||||
tensor_1 = torch.randn(1, 1, 40, 128).to('xpu')
|
||||
tensor_2 = torch.randn(1, 1, 128, 40).to('xpu')
|
||||
print(torch.matmul(tensor_1, tensor_2).size())
|
||||
```
|
||||
It will output following content at the end:
|
||||
```
|
||||
torch.Size([1, 1, 40, 40])
|
||||
```
|
||||
|
||||
```eval_rst
|
||||
.. seealso::
|
||||
|
||||
If you encounter any problem, please refer to `here <https://bigdl.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#troubleshooting>`_ for help.
|
||||
```
|
||||
* To exit the Python interactive shell, simply press Ctrl+Z then press Enter (or input `exit()` then press Enter).
|
||||
|
||||
|
||||
## A Quick Example
|
||||
|
||||
Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://huggingface.co/Qwen/Qwen-1_8B-Chat) model, a 1.8 billion parameter LLM for this demonstration. Follow the steps below to setup and run the model, and observe how it responds to a prompt "What is AI?".
|
||||
|
||||
* Step 1: Open the **Anaconda Prompt** and activate the Python environment `llm` you previously created:
|
||||
```cmd
|
||||
conda activate llm
|
||||
```
|
||||
* Step 2: Configure oneAPI variables by running the following command:
|
||||
```cmd
|
||||
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
|
||||
```
|
||||
* Step 3:
|
||||
Please also set the following environment variable according to your device:
|
||||
|
||||
```eval_rst
|
||||
.. tabs::
|
||||
.. tab:: Intel iGPU
|
||||
|
||||
.. code-block:: cmd
|
||||
|
||||
set SYCL_CACHE_PERSISTENT=1
|
||||
set BIGDL_LLM_XMX_DISABLED=1
|
||||
|
||||
.. tab:: Intel Arc™ A770
|
||||
|
||||
There is no need to set further environment variables.
|
||||
```
|
||||
|
||||
```eval_rst
|
||||
.. seealso::
|
||||
|
||||
For other Intel dGPU Series, please refer to `this guide <https://bigdl.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#runtime-configuration>`_ for more details regarding runtime configuration.
|
||||
```
|
||||
* Step 4: Install additional package required for Qwen-1.8B-Chat to conduct:
|
||||
```cmd
|
||||
pip install tiktoken transformers_stream_generator einops
|
||||
```
|
||||
* Step 5: Create code file. BigDL-LLM supports loading model from Hugging Face or ModelScope. Please choose according to your requirements.
|
||||
```eval_rst
|
||||
.. tabs::
|
||||
.. tab:: Hugging Face
|
||||
Create a new file named ``demo.py`` and insert the code snippet below to run `Qwen-1.8B-Chat <https://huggingface.co/Qwen/Qwen-1_8B-Chat>`_ model with BigDL-LLM optimizations.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# Copy/Paste the contents to a new file demo.py
|
||||
import torch
|
||||
from bigdl.llm.transformers import AutoModelForCausalLM
|
||||
from transformers import AutoTokenizer, GenerationConfig
|
||||
generation_config = GenerationConfig(use_cache = True)
|
||||
generation_config = GenerationConfig(use_cache=True)
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code=True)
|
||||
# load Model using bigdl-llm and load it to GPU
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"microsoft/phi-1_5", load_in_4bit=True, cpu_embedding=True, trust_remote_code=True)
|
||||
print('Now start loading Tokenizer and optimizing Model...')
|
||||
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-1_8B-Chat",
|
||||
trust_remote_code=True)
|
||||
|
||||
# Load Model using bigdl-llm and load it to GPU
|
||||
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat",
|
||||
load_in_4bit=True,
|
||||
cpu_embedding=True,
|
||||
trust_remote_code=True)
|
||||
model = model.to('xpu')
|
||||
print('Successfully loaded Tokenizer and optimized Model!')
|
||||
|
||||
# Format the prompt
|
||||
question = "What is AI?"
|
||||
prompt = " Question:{prompt}\n\n Answer:".format(prompt=question)
|
||||
prompt = "user: {prompt}\n\nassistant:".format(prompt=question)
|
||||
|
||||
# Generate predicted tokens
|
||||
with torch.inference_mode():
|
||||
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
|
||||
# warm up one more time before the actual generation task for the first run, see details in `Tips & Troubleshooting`
|
||||
# output = model.generate(input_ids, do_sample=False, max_new_tokens=32, generation_config = generation_config)
|
||||
output = model.generate(input_ids, do_sample=False, max_new_tokens=32, generation_config = generation_config).cpu()
|
||||
|
||||
print('--------------------------------------Note-----------------------------------------')
|
||||
print('| For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or |')
|
||||
print('| Pro A60, it may take several minutes for GPU kernels to compile and initialize. |')
|
||||
print('| Please be patient until it finishes warm-up... |')
|
||||
print('-----------------------------------------------------------------------------------')
|
||||
|
||||
# To achieve optimal and consistent performance, we recommend a one-time warm-up by running `model.generate(...)` an additional time before starting your actual generation tasks.
|
||||
# If you're developing an application, you can incorporate this warm-up step into start-up or loading routine to enhance the user experience.
|
||||
output = model.generate(input_ids,
|
||||
do_sample=False,
|
||||
max_new_tokens=32,
|
||||
generation_config=generation_config) # warm-up
|
||||
|
||||
print('Successfully finished warm-up, now start generation...')
|
||||
|
||||
output = model.generate(input_ids,
|
||||
do_sample=False,
|
||||
max_new_tokens=32,
|
||||
generation_config=generation_config).cpu()
|
||||
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
|
||||
print(output_str)
|
||||
|
||||
.. tab:: ModelScope
|
||||
|
||||
Please first run following command in Anaconda Prompt to install ModelScope:
|
||||
|
||||
.. code-block:: cmd
|
||||
|
||||
pip install modelscope==1.11.0
|
||||
|
||||
Create a new file named ``demo.py`` and insert the code snippet below to run `Qwen-1.8B-Chat <https://www.modelscope.cn/models/qwen/Qwen-1_8B-Chat/summary>`_ model with BigDL-LLM optimizations.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# Copy/Paste the contents to a new file demo.py
|
||||
import torch
|
||||
from bigdl.llm.transformers import AutoModelForCausalLM
|
||||
from transformers import GenerationConfig
|
||||
from modelscope import AutoTokenizer
|
||||
generation_config = GenerationConfig(use_cache=True)
|
||||
|
||||
print('Now start loading Tokenizer and optimizing Model...')
|
||||
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-1_8B-Chat",
|
||||
trust_remote_code=True)
|
||||
|
||||
# Load Model using bigdl-llm and load it to GPU
|
||||
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat",
|
||||
load_in_4bit=True,
|
||||
cpu_embedding=True,
|
||||
trust_remote_code=True,
|
||||
model_hub='modelscope')
|
||||
model = model.to('xpu')
|
||||
print('Successfully loaded Tokenizer and optimized Model!')
|
||||
|
||||
# Format the prompt
|
||||
question = "What is AI?"
|
||||
prompt = "user: {prompt}\n\nassistant:".format(prompt=question)
|
||||
|
||||
# Generate predicted tokens
|
||||
with torch.inference_mode():
|
||||
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
|
||||
|
||||
print('--------------------------------------Note-----------------------------------------')
|
||||
print('| For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or |')
|
||||
print('| Pro A60, it may take several minutes for GPU kernels to compile and initialize. |')
|
||||
print('| Please be patient until it finishes warm-up... |')
|
||||
print('-----------------------------------------------------------------------------------')
|
||||
|
||||
# To achieve optimal and consistent performance, we recommend a one-time warm-up by running `model.generate(...)` an additional time before starting your actual generation tasks.
|
||||
# If you're developing an application, you can incorporate this warm-up step into start-up or loading routine to enhance the user experience.
|
||||
output = model.generate(input_ids,
|
||||
do_sample=False,
|
||||
max_new_tokens=32,
|
||||
generation_config=generation_config) # warm-up
|
||||
|
||||
print('Successfully finished warm-up, now start generation...')
|
||||
|
||||
output = model.generate(input_ids,
|
||||
do_sample=False,
|
||||
max_new_tokens=32,
|
||||
generation_config=generation_config).cpu()
|
||||
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
|
||||
print(output_str)
|
||||
|
||||
|
||||
.. tip::
|
||||
|
||||
Please note that the repo id on ModelScope may be difference from Hugging Face for some models.
|
||||
|
||||
```
|
||||
|
||||
```eval_rst
|
||||
.. note::
|
||||
|
||||
When running LLMs on Intel iGPUs with limited memory size, we recommend setting ``cpu_embedding=True`` in the ``from_pretrained`` function.
|
||||
This will allow the memory-intensive embedding layer to utilize the CPU instead of GPU.
|
||||
```
|
||||
> Note: when running LLMs on Intel iGPUs with limited memory size, we recommend setting `cpu_embedding=True` in the `from_pretrained` function.
|
||||
> This will allow the memory-intensive embedding layer to utilize the CPU instead of GPU.
|
||||
|
||||
* Step 5. Run `demo.py` within the activated Python environment using the following command:
|
||||
```cmd
|
||||
|
|
@ -127,14 +326,14 @@ Now let's play with a real LLM. We'll be using the [phi-1.5](https://huggingface
|
|||
|
||||
### Example output
|
||||
|
||||
Example output on a system equipped with an 11th Gen Intel Core i7 CPU and Iris Xe Graphics iGPU:
|
||||
Example output on a system equipped with an Intel Core Ultra 5 125H CPU and Intel Arc Graphics iGPU:
|
||||
```
|
||||
Question:What is AI?
|
||||
Answer: AI stands for Artificial Intelligence, which is the simulation of human intelligence in machines.
|
||||
user: What is AI?
|
||||
|
||||
assistant: AI stands for Artificial Intelligence, which refers to the development of computer systems that can perform tasks that typically require human intelligence, such as visual perception, speech recognition,
|
||||
```
|
||||
|
||||
## Tips & Troubleshooting
|
||||
|
||||
### Warmup for optimial performance on first run
|
||||
When running LLMs on GPU for the first time, you might notice the performance is lower than expected, with delays up to several minutes before the first token is generated. This delay occurs because the GPU kernels require compilation and initialization, which varies across different GPU models. To achieve optimal and consistent performance, we recommend a one-time warm-up by running `model.generate(...)` an additional time before starting your actual generation tasks. If you're developing an application, you can incorporate this warmup step into start-up or loading routine to enhance the user experience.
|
||||
|
||||
### Warm-up for optimal performance on first run
|
||||
When running LLMs on GPU for the first time, you might notice the performance is lower than expected, with delays up to several minutes before the first token is generated. This delay occurs because the GPU kernels require compilation and initialization, which varies across different GPU models. To achieve optimal and consistent performance, we recommend a one-time warm-up by running `model.generate(...)` an additional time before starting your actual generation tasks. If you're developing an application, you can incorporate this warm-up step into start-up or loading routine to enhance the user experience.
|
||||
|
|
|
|||
|
|
@ -6,7 +6,7 @@ This quickstart guide walks you through setting up and using the [Text Generatio
|
|||
|
||||
A preview of the WebUI in action is shown below:
|
||||
|
||||
<img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_chat.png" width=80%; />
|
||||
<img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_chat.png" width=100%; />
|
||||
|
||||
|
||||
|
||||
|
|
@ -38,7 +38,13 @@ pip install -r requirements_cpu_only.txt
|
|||
|
||||
### Set Environment Variables
|
||||
Configure oneAPI variables by running the following command in **Anaconda Prompt**:
|
||||
> Note: For more details about runtime configurations, refer to [this guide](https://bigdl.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#runtime-configuration):
|
||||
|
||||
```eval_rst
|
||||
.. note::
|
||||
|
||||
For more details about runtime configurations, `refer to this guide <https://bigdl.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#runtime-configuration>`_
|
||||
```
|
||||
|
||||
```cmd
|
||||
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
|
||||
```
|
||||
|
|
@ -50,7 +56,13 @@ set BIGDL_LLM_XMX_DISABLED=1
|
|||
|
||||
### Launch the Server
|
||||
In **Anaconda Prompt** with the conda environment `llm` activated, navigate to the text-generation-webui folder and start the server using the following command:
|
||||
> Note: with `--load-in-4bit` option, the models will be optimized and run at 4-bit precision. For configuration for other formats and precisions, refer to [this link](https://github.com/intel-analytics/text-generation-webui?tab=readme-ov-file#32-optimizations-for-other-percisions).
|
||||
|
||||
```eval_rst
|
||||
.. note::
|
||||
|
||||
with ``--load-in-4bit`` option, the models will be optimized and run at 4-bit precision. For configuration for other formats and precisions, refer to `this link <https://github.com/intel-analytics/text-generation-webui?tab=readme-ov-file#32-optimizations-for-other-percisions>`_
|
||||
```
|
||||
|
||||
```cmd
|
||||
python server.py --load-in-4bit
|
||||
```
|
||||
|
|
@ -60,7 +72,7 @@ Upon successful launch, URLs to access the WebUI will be displayed in the termin
|
|||
<!-- ```cmd
|
||||
Running on local URL: http://127.0.0.1:7860
|
||||
``` -->
|
||||
<img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_launch_server.png" width=80%; />
|
||||
<img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_launch_server.png" width=100%; />
|
||||
|
||||
|
||||
## 4. Using the WebUI
|
||||
|
|
@ -69,11 +81,11 @@ Upon successful launch, URLs to access the WebUI will be displayed in the termin
|
|||
|
||||
Place Huggingface models in `C:\text-generation-webui\models` by either copying locally or downloading via the WebUI. To download, navigate to the **Model** tab, enter the model's huggingface id (for instance, `Qwen/Qwen-7B-Chat`) in the **Download model or LoRA** section, and click **Download**, as illustrated below.
|
||||
|
||||
<img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_download_model.png" width=80%; />
|
||||
<img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_download_model.png" width=100%; />
|
||||
|
||||
After copying or downloading the models, click on the blue **refresh** button to update the **Model** drop-down menu. Then, choose your desired model from the newly updated list.
|
||||
|
||||
<img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_select_model.png" width=80%; />
|
||||
<img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_select_model.png" width=100%; />
|
||||
|
||||
|
||||
### Load Model
|
||||
|
|
@ -82,7 +94,7 @@ Default settings are recommended for most users. Click **Load** to activate the
|
|||
|
||||
If everything goes well, you will get a message as shown below.
|
||||
|
||||
<img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_load_model_success.png" width=80%; />
|
||||
<img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_load_model_success.png" width=100%; />
|
||||
|
||||
|
||||
|
||||
|
|
@ -92,7 +104,7 @@ In the **Chat** tab, start new conversations with **New chat**.
|
|||
|
||||
Enter prompts into the textbox at the bottom and press the **Generate** button to receive responses.
|
||||
|
||||
<img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_chat.png" width=80%; />
|
||||
<img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_chat.png" width=100%; />
|
||||
|
||||
<!-- Notes:
|
||||
* Multi-turn conversations may consume GPU memory. You may specify the `Truncate the prompt up to this length` value in `Parameters` tab to reduce the GPU memory usage.
|
||||
|
|
@ -122,4 +134,4 @@ If there are still errors on missing packages, repeat the installation process f
|
|||
### Compatiblity issues
|
||||
If you encounter **AttributeError** errors like shown below, it may be due to some models being incompatible with the current version of the transformers package because they are outdated. In such instances, using a more recent model is recommended.
|
||||
|
||||
<img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_load_model_error.png" width=80%; />
|
||||
<img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_load_model_error.png" width=100%; />
|
||||
|
|
|
|||
Loading…
Reference in a new issue