ipex-llm/docs/readthedocs/source/doc/LLM/Overview/llm.md
Yuwen Hu cf6a620bae [LLM] BigDL-LLM Documentation Initial Version (#8833)
* Change order of LLM in header

* Some updates to footer

* Add BigDL-LLM index page and basic file structure

* Update index page for key features

* Add initial content for BigDL-LLM in 5 mins

* Improvement to footnote

* Add initial contents based on current contents we have

* Add initial quick links

* Small fix

* Rename file

* Hide cli section for now and change model supports to examples

* Hugging Face format -> Hugging Face transformers format

* Add placeholder for GPU supports

* Add GPU related content structure

* Add cpu/gpu installation initial contents

* Add initial contents for GPU supports

* Add image link to LLM index page

* Hide tips and known issues for now

* Small fix

* Update based on comments

* Small fix

* Add notes for Python 3.9

* Add placehoder optimize model & reveal CLI; small revision

* examples add gpu part

* Hide CLI part again for first version of merging

* add keyfeatures-optimize_model part (#1)

* change gif link to the ones hosted on github

* Small fix

---------

Co-authored-by: plusbang <binbin1.deng@intel.com>
Co-authored-by: binbin Deng <108676127+plusbang@users.noreply.github.com>
2023-09-06 15:38:45 +08:00

68 lines
3.1 KiB
Markdown

# BigDL-LLM in 5 minutes
You can use BigDL-LLM to run any [*Hugging Face Transformers*](https://huggingface.co/docs/transformers/index) PyTorch model. It automatically optimizes and accelerates LLMs using low-precision (INT4/INT5/INT8) techniques, modern hardware accelerations and latest software optimizations.
Hugging Face transformers-based applications can run on BigDL-LLM with one-line code change, and you'll immediately observe significant speedup<sup><a href="#footnote-perf" id="ref-perf">[1]</a></sup>.
Here, let's take a relatively small LLM model, i.e [open_llama_3b_v2](https://huggingface.co/openlm-research/open_llama_3b_v2), and BigDL-LLM INT4 optimizations as an example.
## Load a Pretrained Model
Simply use one-line `transformers`-style API in `bigdl-llm` to load `open_llama_3b_v2` with INT4 optimization (by specifying `load_in_4bit=True`) as follows:
```python
from bigdl.llm.transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path="openlm-research/open_llama_3b_v2",
load_in_4bit=True)
```
```eval_rst
.. tip::
`open_llama_3b_v2 <https://huggingface.co/openlm-research/open_llama_3b_v2>`_ is a pretrained large language model hosted on Hugging Face. ``openlm-research/open_llama_3b_v2`` is its Hugging Face model id. ``from_pretrained`` will automatically download the model from Hugging Face to a local cache path (e.g. ``~/.cache/huggingface``), load the model, and converted it to ``bigdl-llm`` INT4 format.
It may take a long time to download the model using API. You can also download the model yourself, and set ``pretrained_model_name_or_path`` to the local path of the downloaded model. This way, ``from_pretrained`` will load and convert directly from local path without download.
```
## Load Tokenizer
You also need a tokenizer for inference. Just use the official `transformers` API to load `LlamaTokenizer`:
```python
from transformers import LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained(pretrained_model_name_or_path="openlm-research/open_llama_3b_v2")
```
## Run LLM
Now you can do model inference exactly the same way as using official `transformers` API:
```python
import torch
with torch.inference_mode():
prompt = 'Q: What is CPU?\nA:'
# tokenize the input prompt from string to token ids
input_ids = tokenizer.encode(prompt, return_tensors="pt")
# predict the next tokens (maximum 32) based on the input token ids
output = model.generate(input_ids,
max_new_tokens=32)
# decode the predicted token ids to output string
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_str)
```
------
<div>
<p>
<sup><a href="#ref-perf" id="footnote-perf">[1]</a>
Performance varies by use, configuration and other factors. <code><span>bigdl-llm</span></code> may not optimize to the same degree for non-Intel products. Learn more at <a href="https://www.Intel.com/PerformanceIndex">www.Intel.com/PerformanceIndex</a>.
</sup>
</p>
</div>