| .. | ||
| characters | ||
| css | ||
| extensions | ||
| grammars | ||
| js | ||
| models | ||
| modules | ||
| presets | ||
| prompts | ||
| readme_folder | ||
| README.md | ||
| requirements.txt | ||
| server.py | ||
This tutorial provides a step-by-step guide on how to use Text-Generation-WebUI to run Hugging Face transformers-based applications on BigDL-LLM.
The WebUI is ported from Text-Generation-WebUI.
1. Prepare the environment on Windows
Please use a python environment management tool (we recommend using Conda) to create a python enviroment and install necessary libs.
1.1 Install BigDL-LLM
Please see BigDL-LLm Installation on Windows for more details to install BigDL-LLM on your Client.
1.2 Install other required dependencies
pip install -r requirements.txt gradio==3.50.*
Note: Text-Generation-WebUI requires transformers version >= 4.36.0
2. Start the WebUI Server
2.1 For INT4 Optimizations
For a quick start, you may run the script as below to start WebUI directly, it will automatically optimize and accelerate LLMs using INT4 optimizations.
python server.py
2.2 Optimizations for Other Percisions
To enable optimizations for more precisions (sym_int4, asym_int4, sym_int8, fp4, fp8, fp16, mixed_fp4, mixed_fp8, etc.), you may run the command as below:
python server.py --load-in-low-bit
2.3 Access the WebUI
After the successful startup of the WebUI server, it will provide links to access the WebUI as below. Please open the public URL in your browser to access the full functionality of the WebUI.
Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://your_tokens_here.gradio.live
This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)
3. Run Models
3.1 Select the Model
First, place your local model in Text-Generation-WebUI/models directory, you may also choose to download a model from Hugging Face.
Next, please click the Model button to choose your model.
3.2 Enable BigDL-LLM Optimizations
Text-Generation-WebUI supports multiple backends, including Transformers, llama.cpp, BigDL-LLM, etc. Please select the BigDL-LLM backend as below to enable low-bit optimizations.
Then please select the device according to your device.
3.3 Load Model in Low Precision
One common use case of BigDL-LLM is to load a Hugging Face transformers model in low precision.
Notes:
- 
When you start the web UI with
--load-in-4bit, you will not be allowed to choose the quantization precision inload-in-low-bit. The model will be loaded with the INT4 precision as default. - 
When you want to load model in other precisions, please run server.py with
--load-in-low-bitparameter. You may choose the precision from the list ofload-in-low-bitoption, and theload-in-4bitoption will be disabled. - 
Please select the
optimize-modelanduse_cacheoptions to accelerate the model. 
Now you may click the Load button to load the model with BigDL-LLM optimizations.
3.4 Run the Model on WebUI
Now you can do model inference on Text-Generation-WebUI with BigDL-LLM optimizations, including Chat, Default and Notebook Tabs. Please see Chat-Tab Wiki and Default and Notebook Tabs Wiki for more details.




