* Change order of LLM in header * Some updates to footer * Add BigDL-LLM index page and basic file structure * Update index page for key features * Add initial content for BigDL-LLM in 5 mins * Improvement to footnote * Add initial contents based on current contents we have * Add initial quick links * Small fix * Rename file * Hide cli section for now and change model supports to examples * Hugging Face format -> Hugging Face transformers format * Add placeholder for GPU supports * Add GPU related content structure * Add cpu/gpu installation initial contents * Add initial contents for GPU supports * Add image link to LLM index page * Hide tips and known issues for now * Small fix * Update based on comments * Small fix * Add notes for Python 3.9 * Add placehoder optimize model & reveal CLI; small revision * examples add gpu part * Hide CLI part again for first version of merging * add keyfeatures-optimize_model part (#1) * change gif link to the ones hosted on github * Small fix --------- Co-authored-by: plusbang <binbin1.deng@intel.com> Co-authored-by: binbin Deng <108676127+plusbang@users.noreply.github.com>
		
			
				
	
	
	
	
		
			3.1 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	BigDL-LLM in 5 minutes
You can use BigDL-LLM to run any Hugging Face Transformers PyTorch model. It automatically optimizes and accelerates LLMs using low-precision (INT4/INT5/INT8) techniques, modern hardware accelerations and latest software optimizations.
Hugging Face transformers-based applications can run on BigDL-LLM with one-line code change, and you'll immediately observe significant speedup[1].
Here, let's take a relatively small LLM model, i.e open_llama_3b_v2, and BigDL-LLM INT4 optimizations as an example.
Load a Pretrained Model
Simply use one-line transformers-style API in bigdl-llm to load open_llama_3b_v2 with INT4 optimization (by specifying load_in_4bit=True) as follows:
from bigdl.llm.transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path="openlm-research/open_llama_3b_v2",
                                             load_in_4bit=True)
.. tip::
   `open_llama_3b_v2 <https://huggingface.co/openlm-research/open_llama_3b_v2>`_ is a pretrained large language model hosted on Hugging Face. ``openlm-research/open_llama_3b_v2`` is its Hugging Face model id. ``from_pretrained`` will automatically download the model from Hugging Face to a local cache path (e.g. ``~/.cache/huggingface``), load the model, and converted it to ``bigdl-llm`` INT4 format.
   It may take a long time to download the model using API. You can also download the model yourself, and set ``pretrained_model_name_or_path`` to the local path of the downloaded model. This way, ``from_pretrained`` will load and convert directly from local path without download.
Load Tokenizer
You also need a tokenizer for inference. Just use the official transformers API to load LlamaTokenizer:
from transformers import LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained(pretrained_model_name_or_path="openlm-research/open_llama_3b_v2")
Run LLM
Now you can do model inference exactly the same way as using official transformers API:
import torch
with torch.inference_mode():
    prompt = 'Q: What is CPU?\nA:'
    
    # tokenize the input prompt from string to token ids
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    # predict the next tokens (maximum 32) based on the input token ids
    output = model.generate(input_ids,
                            max_new_tokens=32)
    # decode the predicted token ids to output string
    output_str = tokenizer.decode(output[0], skip_special_tokens=True)
    
    print(output_str)
        [1]
            Performance varies by use, configuration and other factors. bigdl-llm may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex.