diff --git a/docs/readthedocs/source/_toc.yml b/docs/readthedocs/source/_toc.yml index 094ebb4d..b4c659c3 100644 --- a/docs/readthedocs/source/_toc.yml +++ b/docs/readthedocs/source/_toc.yml @@ -47,6 +47,10 @@ subtrees: - file: doc/LLM/Overview/KeyFeatures/langchain_api # - file: doc/LLM/Overview/KeyFeatures/cli - file: doc/LLM/Overview/KeyFeatures/gpu_supports + subtrees: + - entries: + - file: doc/LLM/Overview/KeyFeatures/inference_on_gpu + - file: doc/LLM/Overview/KeyFeatures/finetune - file: doc/LLM/Overview/examples title: "Examples" subtrees: diff --git a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/finetune.md b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/finetune.md new file mode 100644 index 00000000..69c926eb --- /dev/null +++ b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/finetune.md @@ -0,0 +1,63 @@ +# Finetune (QLoRA) + +We also support finetuning LLMs (large language models) using QLoRA with BigDL-LLM 4bit optimizations on Intel GPUs. + +```eval_rst +.. note:: + + Currently, only Hugging Face Transformers models are supported running QLoRA finetuning. +``` + +To help you better understand the finetuning process, here we use model [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) as an example. + +**Make sure you have prepared environment following instructions [here](../install_gpu.html). First of all, you need to import `intel_extension_for_pytorch` to run on Intel GPUs**: + +```python +import intel_extension_for_pytorch as ipex +``` + +First, load model using `transformers`-style API and **set it to `to('xpu')`**. We specify `load_in_low_bit="nf4"` here to apply 4-bit NormalFloat optimization. According to the [QLoRA paper](https://arxiv.org/pdf/2305.14314.pdf), using `"nf4"` could yield better model quality than `"int4"`. + +```python +import intel_extension_for_pytorch as ipex +from bigdl.llm.transformers import AutoModelForCausalLM + +model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", + load_in_low_bit="nf4", + optimize_model=False, + torch_dtype=torch.float16, + modules_to_not_convert=["lm_head"],) +model = model.to('xpu') +``` + +Then, we have to apply some preprocessing to the model to prepare it for training. +```python +from bigdl.llm.transformers.qlora import prepare_model_for_kbit_training +model.gradient_checkpointing_enable() +model = prepare_model_for_kbit_training(model) +``` + +Next, we can obtain a Peft model from the optimized model and a configuration object containing the parameters as follows: +```python +from bigdl.llm.transformers.qlora import get_peft_model +from peft import LoraConfig +config = LoraConfig(r=8, + lora_alpha=32, + target_modules=["q_proj", "k_proj", "v_proj"], + lora_dropout=0.05, + bias="none", + task_type="CAUSAL_LM") +model = get_peft_model(model, config) +``` + +```eval_rst +.. important:: + + Instead of ``from peft import prepare_model_for_kbit_training, get_peft_model`` as we did for regular QLoRA using bitandbytes and cuda, we import them from ``bigdl.llm.transformers.qlora`` here to get a BigDL-LLM compatible Peft model. And the rest is just the same as regular LoRA finetuning process using ``peft``. +``` + +```eval_rst +.. seealso:: + + See the complete examples `here `_ +``` diff --git a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/gpu_supports.md b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/gpu_supports.md deleted file mode 100644 index 7cbd65c3..00000000 --- a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/gpu_supports.md +++ /dev/null @@ -1,47 +0,0 @@ -# GPU Supports - -You may apply INT4 optimizations to any Hugging Face *Transformers* models on device with Intel GPUs as follows: - -```python -# import ipex -import intel_extension_for_pytorch as ipex - -# load Hugging Face Transformers model with INT4 optimizations on Intel GPUs -from bigdl.llm.transformers import AutoModelForCausalLM - -model = AutoModelForCausalLM.from_pretrained('/path/to/model/', - load_in_4bit=True, - optimize_model=False) -model = model.to('xpu') -``` - -```eval_rst -.. note:: - - You may apply INT8 optimizations as follows: - - .. code-block:: python - - model = AutoModelForCausalLM.from_pretrained('/path/to/model/', - load_in_low_bit="sym_int8", - optimize_model=False) - model = model.to('xpu') -``` - -After loading the Hugging Face *Transformers* model, you may easily run the optimized model as follows: - -```python -# run the optimized model -from transformers import AutoTokenizer - -tokenizer = AutoTokenizer.from_pretrained(model_path) -input_ids = tokenizer.encode(input_str, ...).to('xpu') -output_ids = model.generate(input_ids, ...) -output = tokenizer.batch_decode(output_ids) -``` - -```eval_rst -.. seealso:: - - See the complete examples `here `_ -``` \ No newline at end of file diff --git a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/gpu_supports.rst b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/gpu_supports.rst new file mode 100644 index 00000000..af0e6e6a --- /dev/null +++ b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/gpu_supports.rst @@ -0,0 +1,10 @@ +GPU Supports +================================ + +BigDL-LLM not only supports running large language models for inference, but also supports QLoRA finetuning on Intel GPUs. + +* |inference_on_gpu|_ +* `Finetune (QLoRA) <./finetune.html>`_ + +.. |inference_on_gpu| replace:: Inference on GPU +.. _inference_on_gpu: ./inference_on_gpu.html diff --git a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/index.rst b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/index.rst index 823df5a1..94eb6806 100644 --- a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/index.rst +++ b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/index.rst @@ -10,10 +10,19 @@ You may run the LLMs using ``bigdl-llm`` through one of the following APIs: * `Native Format <./native_format.html>`_ * `LangChain API <./langchain_api.html>`_ -* `GPU Supports <./gpu_supports.html>`_ +* |gpu_supports|_ + + * |inference_on_gpu|_ + * `Finetune (QLoRA) <./finetune.html>`_ .. |transformers_style_api| replace:: ``transformers``-style API .. _transformers_style_api: ./transformers_style_api.html .. |hugging_face_transformers_format| replace:: Hugging Face ``transformers`` Format -.. _hugging_face_transformers_format: ./hugging_face_format.html \ No newline at end of file +.. _hugging_face_transformers_format: ./hugging_face_format.html + +.. |gpu_supports| replace:: GPU Supports +.. _gpu_supports: ./gpu_supports.html + +.. |inference_on_gpu| replace:: Inference on GPU +.. _inference_on_gpu: ./inference_on_gpu.html diff --git a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/inference_on_gpu.md b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/inference_on_gpu.md new file mode 100644 index 00000000..8ca89d8c --- /dev/null +++ b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/inference_on_gpu.md @@ -0,0 +1,109 @@ +# Inference on GPU + +Apart from the significant acceleration capabilites on Intel CPUs, BigDL-LLM also supports optimizations and acceleration for running LLMs (large language models) on Intel GPUs. With BigDL-LLM, PyTorch models (in FP16/BF16/FP32) can be optimized with low-bit quantizations (supported precisions include INT4, INT5, INT8, etc). + +Compared with running on Intel CPUs, some additional operations are required on Intel GPUs. To help you better understand the process, here we use a popular model [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) as an example. + +**Make sure you have prepared environment following instructions [here](../install_gpu.html). First of all, you need to import `intel_extension_for_pytorch` to run on Intel GPUs**: + +```python +import intel_extension_for_pytorch as ipex +``` + +## Load and Optimize Model + +You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-style API](./transformers_style_api.html) on Intel GPUs according to your preference. + +**Once you have the model with BigDL-LLM low bit optimization, set it to `to('xpu')`**. + +```eval_rst +.. tabs:: + + .. tab:: PyTorch API + + You could optimize any PyTorch model with "one-line code change", and the loading and optimizing process on Intel GPUs maybe as follows: + + .. code-block:: python + + # Take Llama-2-7b-chat-hf as an example + import intel_extension_for_pytorch as ipex + from transformers import LlamaForCausalLM + from bigdl.llm import optimize_model + + model = LlamaForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', torch_dtype='auto', low_cpu_mem_usage=True) + model = optimize_model(model) # With only one line to enable BigDL-LLM INT4 optimization + + model = model.to('xpu') # Important after obtaining the optimized model + + Especially, if you have saved the optimized model following setps `here <./optimize_model.html#save>`_, the loading process on Intel GPUs maybe as follows: + + .. code-block:: python + + import intel_extension_for_pytorch as ipex + from transformers import LlamaForCausalLM + from bigdl.llm.optimize import low_memory_init, load_low_bit + + saved_dir='./llama-2-bigdl-llm-4-bit' + with low_memory_init(): # Fast and low cost by loading model on meta device + model = LlamaForCausalLM.from_pretrained(saved_dir, + torch_dtype="auto", + trust_remote_code=True) + model = load_low_bit(model, saved_dir) # Load the optimized model + + model = model.to('xpu') # Important after obtaining the optimized model + + .. tab:: ``transformers``-style API + + You could run any Hugging Face Transformers model with ``transformers``-style API, and the loading and optimizing process on Intel GPUs maybe as follows: + + .. code-block:: python + + # Take Llama-2-7b-chat-hf as an example + import intel_extension_for_pytorch as ipex + from bigdl.llm.transformers import AutoModelForCausalLM + + # Load model in 4 bit, which convert the relevant layers in the model into INT4 format + model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', load_in_4bit=True) + + model = model.to('xpu') # Important after obtaining the optimized model + + Especially, if you have saved the optimized model following setps `here <./hugging_face_format.html#save-load>`_, the loading process on Intel GPUs maybe as follows: + + .. code-block:: python + + import intel_extension_for_pytorch as ipex + from bigdl.llm.transformers import AutoModelForCausalLM + + saved_dir='./llama-2-bigdl-llm-4-bit' + model = AutoModelForCausalLM.load_low_bit(saved_dir) # Load the optimized model + + model = model.to('xpu') # Important after obtaining the optimized model +``` + +## Run Optimized Model + +You could then do inference using the optimized model on Intel GPUs almostly the same as on CPUs. **The only difference is to set `to('xpu')` for input tensors.** + +Continuing with the [example of Llama-2-7b-chat-hf](#load-and-optimize-model), running as follows: +```python +import torch + +with torch.inference_mode(): + prompt = 'Q: What is CPU?\nA:' + input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu') # With .to('xpu') specifically for inference on Intel GPUs + output = model.generate(input_ids, max_new_tokens=32) + output_str = tokenizer.decode(output[0], skip_special_tokens=True) +``` + +```eval_rst +.. note:: + + The initial generation of optimized LLMs on Intel GPUs could be slow. Therefore, it's recommended to perform a **warm-up** run before the actual generation. +``` + + +```eval_rst +.. seealso:: + + See the complete examples `here `_ +``` diff --git a/docs/readthedocs/source/doc/LLM/Overview/install_gpu.md b/docs/readthedocs/source/doc/LLM/Overview/install_gpu.md index 0d36c39f..b2ff5ee0 100644 --- a/docs/readthedocs/source/doc/LLM/Overview/install_gpu.md +++ b/docs/readthedocs/source/doc/LLM/Overview/install_gpu.md @@ -26,6 +26,7 @@ BigDL-LLM for GPU supports has been verified on: * Intel Arcâ„¢ A-Series Graphics * Intel Data Center GPU Flex Series +* Intel Data Center GPU Max Series ```eval_rst .. note::