LLM: improve gpu supports key feature doc page (#9212)

2023-10-19 18:40:48 +08:00 · 2023-10-19 18:40:48 +08:00 · 7e96d3e79a
commit 7e96d3e79a
parent 9dc76f19c0
7 changed files with 198 additions and 49 deletions
--- a/docs/readthedocs/source/_toc.yml
+++ b/docs/readthedocs/source/_toc.yml
@ -47,6 +47,10 @@ subtrees:
                - file: doc/LLM/Overview/KeyFeatures/langchain_api
                # - file: doc/LLM/Overview/KeyFeatures/cli
                - file: doc/LLM/Overview/KeyFeatures/gpu_supports
                  subtrees:
                    - entries:
                      - file: doc/LLM/Overview/KeyFeatures/inference_on_gpu
                      - file: doc/LLM/Overview/KeyFeatures/finetune
          - file: doc/LLM/Overview/examples
            title: "Examples"
            subtrees:
--- a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/finetune.md
+++ b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/finetune.md
@ -0,0 +1,63 @@
 # Finetune (QLoRA)
 We also support finetuning LLMs (large language models) using QLoRA with BigDL-LLM 4bit optimizations on Intel GPUs.
 ```eval_rst
 .. note::
   Currently, only Hugging Face Transformers models are supported running QLoRA finetuning.
 ```
 To help you better understand the finetuning process, here we use model [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) as an example.
 **Make sure you have prepared environment following instructions [here](../install_gpu.html). First of all, you need to import `intel_extension_for_pytorch` to run on Intel GPUs**:
 ```python
 import intel_extension_for_pytorch as ipex
 ```
 First, load model using `transformers`-style API and **set it to `to('xpu')`**. We specify `load_in_low_bit="nf4"` here to apply 4-bit NormalFloat optimization. According to the [QLoRA paper](https://arxiv.org/pdf/2305.14314.pdf), using `"nf4"` could yield better model quality than `"int4"`.
 ```python
 import intel_extension_for_pytorch as ipex
 from bigdl.llm.transformers import AutoModelForCausalLM
 model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf",
                                             load_in_low_bit="nf4",
                                             optimize_model=False,
                                             torch_dtype=torch.float16,
                                             modules_to_not_convert=["lm_head"],)
 model = model.to('xpu')
 ```
 Then, we have to apply some preprocessing to the model to prepare it for training.
 ```python
 from bigdl.llm.transformers.qlora import prepare_model_for_kbit_training
 model.gradient_checkpointing_enable()
 model = prepare_model_for_kbit_training(model)
 ```
 Next, we can obtain a Peft model from the optimized model and a configuration object containing the parameters as follows:
 ```python
 from bigdl.llm.transformers.qlora import get_peft_model
 from peft import LoraConfig
 config = LoraConfig(r=8, 
                    lora_alpha=32, 
                    target_modules=["q_proj", "k_proj", "v_proj"], 
                    lora_dropout=0.05, 
                    bias="none", 
                    task_type="CAUSAL_LM")
 model = get_peft_model(model, config)
 ```
 ```eval_rst
 .. important::
   Instead of ``from peft import prepare_model_for_kbit_training, get_peft_model`` as we did for regular QLoRA using bitandbytes and cuda, we import them from ``bigdl.llm.transformers.qlora`` here to get a BigDL-LLM compatible Peft model. And the rest is just the same as regular LoRA finetuning process using ``peft``.
 ```
 ```eval_rst
 .. seealso::
   See the complete examples `here <https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/GPU>`_
 ```
--- a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/gpu_supports.md
+++ b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/gpu_supports.md
@ -1,47 +0,0 @@
 # GPU Supports
 You may apply INT4 optimizations to any Hugging Face *Transformers* models on device with Intel GPUs as follows:
 ```python
 # import ipex
 import intel_extension_for_pytorch as ipex
 # load Hugging Face Transformers model with INT4 optimizations on Intel GPUs
 from bigdl.llm.transformers import AutoModelForCausalLM
 model = AutoModelForCausalLM.from_pretrained('/path/to/model/',
                                             load_in_4bit=True,
                                             optimize_model=False)
 model = model.to('xpu')
 ```
 ```eval_rst
 .. note::
   You may apply INT8 optimizations as follows:
   .. code-block:: python
      model = AutoModelForCausalLM.from_pretrained('/path/to/model/',
                                                   load_in_low_bit="sym_int8",
                                                   optimize_model=False)
      model = model.to('xpu')
 ```
 After loading the Hugging Face *Transformers* model, you may easily run the optimized model as follows:
 ```python
 # run the optimized model
 from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained(model_path)
 input_ids = tokenizer.encode(input_str, ...).to('xpu')
 output_ids = model.generate(input_ids, ...)
 output = tokenizer.batch_decode(output_ids)
 ```
 ```eval_rst
 .. seealso::
   See the complete examples `here <https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/transformers/transformers_int4/GPU>`_
 ```
--- a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/gpu_supports.rst
+++ b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/gpu_supports.rst
@ -0,0 +1,10 @@
 GPU Supports
 ================================
 BigDL-LLM not only supports running large language models for inference, but also supports QLoRA finetuning on Intel GPUs.
 * |inference_on_gpu|_
 * `Finetune (QLoRA) <./finetune.html>`_
 .. |inference_on_gpu| replace:: Inference on GPU
 .. _inference_on_gpu: ./inference_on_gpu.html
--- a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/index.rst
+++ b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/index.rst
@ -10,10 +10,19 @@ You may run the LLMs using ``bigdl-llm`` through one of the following APIs:
  * `Native Format <./native_format.html>`_
 * `LangChain API <./langchain_api.html>`_
-* `GPU Supports <./gpu_supports.html>`_
+* |gpu_supports|_
  * |inference_on_gpu|_
  * `Finetune (QLoRA) <./finetune.html>`_
 .. |transformers_style_api| replace:: ``transformers``-style API
 .. _transformers_style_api: ./transformers_style_api.html
 .. |hugging_face_transformers_format| replace:: Hugging Face ``transformers`` Format
-.. _hugging_face_transformers_format: ./hugging_face_format.html
+.. _hugging_face_transformers_format: ./hugging_face_format.html
 .. |gpu_supports| replace:: GPU Supports
 .. _gpu_supports: ./gpu_supports.html
 .. |inference_on_gpu| replace:: Inference on GPU
 .. _inference_on_gpu: ./inference_on_gpu.html
--- a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/inference_on_gpu.md
+++ b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/inference_on_gpu.md
@ -0,0 +1,109 @@
 # Inference on GPU
 Apart from the significant acceleration capabilites on Intel CPUs, BigDL-LLM also supports optimizations and acceleration for running LLMs (large language models) on Intel GPUs. With BigDL-LLM, PyTorch models (in FP16/BF16/FP32) can be optimized with low-bit quantizations (supported precisions include INT4, INT5, INT8, etc).
 Compared with running on Intel CPUs, some additional operations are required on Intel GPUs. To help you better understand the process, here we use a popular model [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) as an example.
 **Make sure you have prepared environment following instructions [here](../install_gpu.html). First of all, you need to import `intel_extension_for_pytorch` to run on Intel GPUs**:
 ```python
 import intel_extension_for_pytorch as ipex
 ```
 ## Load and Optimize Model
 You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-style API](./transformers_style_api.html) on Intel GPUs according to your preference.
 **Once you have the model with BigDL-LLM low bit optimization, set it to `to('xpu')`**.
 ```eval_rst
 .. tabs::
   .. tab:: PyTorch API
      You could optimize any PyTorch model with "one-line code change", and the loading and optimizing process on Intel GPUs maybe as follows:
      .. code-block:: python
         # Take Llama-2-7b-chat-hf as an example
         import intel_extension_for_pytorch as ipex
         from transformers import LlamaForCausalLM
         from bigdl.llm import optimize_model
         model = LlamaForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', torch_dtype='auto', low_cpu_mem_usage=True)
         model = optimize_model(model) # With only one line to enable BigDL-LLM INT4 optimization
         model = model.to('xpu') # Important after obtaining the optimized model
      Especially, if you have saved the optimized model following setps `here <./optimize_model.html#save>`_, the loading process on Intel GPUs maybe as follows:
      .. code-block:: python
         import intel_extension_for_pytorch as ipex
         from transformers import LlamaForCausalLM
         from bigdl.llm.optimize import low_memory_init, load_low_bit
         saved_dir='./llama-2-bigdl-llm-4-bit'
         with low_memory_init(): # Fast and low cost by loading model on meta device
            model = LlamaForCausalLM.from_pretrained(saved_dir,
                                                     torch_dtype="auto",
                                                     trust_remote_code=True)
         model = load_low_bit(model, saved_dir) # Load the optimized model
         model = model.to('xpu') # Important after obtaining the optimized model
   .. tab:: ``transformers``-style API
      You could run any Hugging Face Transformers model with ``transformers``-style API, and the loading and optimizing process on Intel GPUs maybe as follows:
      .. code-block:: python
         # Take Llama-2-7b-chat-hf as an example
         import intel_extension_for_pytorch as ipex
         from bigdl.llm.transformers import AutoModelForCausalLM
         # Load model in 4 bit, which convert the relevant layers in the model into INT4 format
         model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', load_in_4bit=True)
         model = model.to('xpu') # Important after obtaining the optimized model
      Especially, if you have saved the optimized model following setps `here <./hugging_face_format.html#save-load>`_, the loading process on Intel GPUs maybe as follows:
      .. code-block:: python
         import intel_extension_for_pytorch as ipex
         from bigdl.llm.transformers import AutoModelForCausalLM
         saved_dir='./llama-2-bigdl-llm-4-bit'
         model = AutoModelForCausalLM.load_low_bit(saved_dir) # Load the optimized model
         model = model.to('xpu') # Important after obtaining the optimized model
 ```
 ## Run Optimized Model
 You could then do inference using the optimized model on Intel GPUs almostly the same as on CPUs. **The only difference is to set `to('xpu')` for input tensors.**
 Continuing with the [example of Llama-2-7b-chat-hf](#load-and-optimize-model), running as follows:
 ```python
 import torch
 with torch.inference_mode():
   prompt = 'Q: What is CPU?\nA:'
   input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu') # With .to('xpu') specifically for inference on Intel GPUs
   output = model.generate(input_ids, max_new_tokens=32)
   output_str = tokenizer.decode(output[0], skip_special_tokens=True)
 ```
 ```eval_rst
 .. note::
   The initial generation of optimized LLMs on Intel GPUs could be slow. Therefore, it's recommended to perform a **warm-up** run before the actual generation.
 ```
 ```eval_rst
 .. seealso::
   See the complete examples `here <https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/GPU>`_
 ```
--- a/docs/readthedocs/source/doc/LLM/Overview/install_gpu.md
+++ b/docs/readthedocs/source/doc/LLM/Overview/install_gpu.md
@ -26,6 +26,7 @@ BigDL-LLM for GPU supports has been verified on:
 * Intel Arc™ A-Series Graphics
 * Intel Data Center GPU Flex Series
 * Intel Data Center GPU Max Series
 ```eval_rst
 .. note::