From 1a1a97c9e403924893b0050a7420f41be70043d8 Mon Sep 17 00:00:00 2001 From: SichengStevenLi <144295301+SichengStevenLi@users.noreply.github.com> Date: Fri, 21 Jun 2024 12:07:50 +0800 Subject: [PATCH] Update mddocs for part of Overview (2/2) and Inference (#11377) * updated link * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed * converted to md format, need to be reviewed, deleted some leftover texts * converted to md file type, need to be reviewed * converted to md file type, need to be reviewed * testing Github Tags * testing Github Tags * added Github Tags * added Github Tags * added Github Tags * Small fix * Small fix * Small fix * Small fix * Small fix * Further fix * Fix index * Small fix * Fix --------- Co-authored-by: Yuwen Hu --- .../Inference/Self_Speculative_Decoding.md | 24 +-- docs/mddocs/Overview/FAQ/faq.md | 19 +-- docs/mddocs/Overview/KeyFeatures/cli.md | 8 +- docs/mddocs/Overview/KeyFeatures/finetune.md | 32 ++-- .../Overview/KeyFeatures/gpu_supports.md | 7 + .../Overview/KeyFeatures/gpu_supports.rst | 14 -- .../KeyFeatures/hugging_face_format.md | 30 ++-- docs/mddocs/Overview/KeyFeatures/index.md | 13 ++ docs/mddocs/Overview/KeyFeatures/index.rst | 33 ----- .../Overview/KeyFeatures/inference_on_gpu.md | 140 ++++++++---------- .../Overview/KeyFeatures/langchain_api.md | 24 +-- 11 files changed, 135 insertions(+), 209 deletions(-) create mode 100644 docs/mddocs/Overview/KeyFeatures/gpu_supports.md delete mode 100644 docs/mddocs/Overview/KeyFeatures/gpu_supports.rst create mode 100644 docs/mddocs/Overview/KeyFeatures/index.md delete mode 100644 docs/mddocs/Overview/KeyFeatures/index.rst diff --git a/docs/mddocs/Inference/Self_Speculative_Decoding.md b/docs/mddocs/Inference/Self_Speculative_Decoding.md index 99179194..80c68fea 100644 --- a/docs/mddocs/Inference/Self_Speculative_Decoding.md +++ b/docs/mddocs/Inference/Self_Speculative_Decoding.md @@ -1,23 +1,23 @@ # Self-Speculative Decoding ### Speculative Decoding in Practice -In [speculative](https://arxiv.org/abs/2302.01318) [decoding](https://arxiv.org/abs/2211.17192), a small (draft) model quickly generates multiple draft tokens, which are then verified in parallel by the large (target) model. While speculative decoding can effectively speed up the target model, ***in practice it is difficult to maintain or even obtain a proper draft model***, especially when the target model is finetuned with customized data. +In [speculative](https://arxiv.org/abs/2302.01318) [decoding](https://arxiv.org/abs/2211.17192), a small (draft) model quickly generates multiple draft tokens, which are then verified in parallel by the large (target) model. While speculative decoding can effectively speed up the target model, ***in practice it is difficult to maintain or even obtain a proper draft model***, especially when the target model is finetuned with customized data. -### Self-Speculative Decoding +### Self-Speculative Decoding Built on top of the concept of “[self-speculative decoding](https://arxiv.org/abs/2309.08168)”, IPEX-LLM can now accelerate the original FP16 or BF16 model ***without the need of a separate draft model or model finetuning***; instead, it automatically converts the original model to INT4, and uses the INT4 model as the draft model behind the scene. In practice, this brings ***~30% speedup*** for FP16 and BF16 LLM inference latency on Intel GPU and CPU respectively. ### Using IPEX-LLM Self-Speculative Decoding Please refer to IPEX-LLM self-speculative decoding code snippets below, and the detailed [GPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/Speculative-Decoding) and [CPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/Speculative-Decoding) examples in the project repo. -```python +```python model = AutoModelForCausalLM.from_pretrained(model_path, - optimize_model=True, - torch_dtype=torch.float16, #use bfloat16 on cpu - load_in_low_bit="fp16", #use bf16 on cpu - speculative=True, #set speculative to true - trust_remote_code=True, - use_cache=True) + optimize_model=True, + torch_dtype=torch.float16, #use bfloat16 on cpu + load_in_low_bit="fp16", #use bf16 on cpu + speculative=True, #set speculative to true + trust_remote_code=True, + use_cache=True) output = model.generate(input_ids, - max_new_tokens=args.n_predict, - do_sample=False) -``` + max_new_tokens=args.n_predict, + do_sample=False) +``` \ No newline at end of file diff --git a/docs/mddocs/Overview/FAQ/faq.md b/docs/mddocs/Overview/FAQ/faq.md index caf8bd51..284cb841 100644 --- a/docs/mddocs/Overview/FAQ/faq.md +++ b/docs/mddocs/Overview/FAQ/faq.md @@ -5,33 +5,34 @@ ### GGUF format usage with IPEX-LLM? IPEX-LLM supports running GGUF/AWQ/GPTQ models on both [CPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Advanced-Quantizations) and [GPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations). + Please also refer to [here](https://github.com/intel-analytics/ipex-llm?tab=readme-ov-file#latest-update-) for our latest support. ## How to Resolve Errors -### Fail to install `ipex-llm` through `pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/` or `pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/` - +### Fail to install `ipex-llm` through `pip install --pre --upgrade ipex-llm[xpu] --extra-index-urlhttps://pytorch-extension.intel.com/release-whl/stable/xpu/us/` or `pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/` You could try to install IPEX-LLM dependencies for Intel XPU from source archives: -- For Windows system, refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#install-ipex-llm-from-wheel) for the steps. -- For Linux system, refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#id3) for the steps. +- For Windows system, refer to [here](../install_gpu.md#install-ipex-llm-from-wheel) for the steps. +- For Linux system, refer to [here](../install_gpu.md#prerequisites-1) for the steps. ### PyTorch is not linked with support for xpu devices -1. Before running on Intel GPUs, please make sure you've prepared environment follwing [installation instruction](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html). +1. Before running on Intel GPUs, please make sure you've prepared environment follwing [installation instruction](../install_gpu.md). 2. If you are using an older version of `ipex-llm` (specifically, older than 2.5.0b20240104), you need to manually add `import intel_extension_for_pytorch as ipex` at the beginning of your code. 3. After optimizing the model with IPEX-LLM, you need to move model to GPU through `model = model.to('xpu')`. -4. If you have mutil GPUs, you could refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/KeyFeatures/multi_gpus_selection.html) for details about GPU selection. +4. If you have mutil GPUs, you could refer to [here](../KeyFeatures/multi_gpus_selection.md) for details about GPU selection. 5. If you do inference using the optimized model on Intel GPUs, you also need to set `to('xpu')` for input tensors. ### Import `intel_extension_for_pytorch` error on Windows GPU -Please refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#error-loading-intel-extension-for-pytorch) for detailed guide. We list the possible missing requirements in environment which could lead to this error. +Please refer to [here](../install_gpu.md#1-error-loading-intel_extension_for_pytorch) +for detailed guide. We list the possible missing requirements in environment which could lead to this error. ### XPU device count is zero It's recommended to reinstall driver: -- For Windows system, refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#prerequisites) for the steps. -- For Linux system, refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#id1) for the steps. +- For Windows system, refer to [here](../install_gpu.md#windows) for the steps. +- For Linux system, refer to [here](../install_gpu.md#prerequisites-1) for the steps. ### Error such as `The size of tensor a (33) must match the size of tensor b (17) at non-singleton dimension 2` duing attention forward function diff --git a/docs/mddocs/Overview/KeyFeatures/cli.md b/docs/mddocs/Overview/KeyFeatures/cli.md index ab162594..ba6c3919 100644 --- a/docs/mddocs/Overview/KeyFeatures/cli.md +++ b/docs/mddocs/Overview/KeyFeatures/cli.md @@ -1,11 +1,7 @@ # CLI (Command Line Interface) Tool -```eval_rst - -.. note:: - - Currently ``ipex-llm`` CLI supports *LLaMA* (e.g., vicuna), *GPT-NeoX* (e.g., redpajama), *BLOOM* (e.g., pheonix) and *GPT2* (e.g., starcoder) model architecture; for other models, you may use the ``transformers``-style or LangChain APIs. -``` +> [!NOTE] +> Currently `ipex-llm` CLI supports *LLaMA* (e.g., vicuna), *GPT-NeoX* (e.g., redpajama), *BLOOM* (e.g., pheonix) and *GPT2* (e.g., starcoder) model architecture; for other models, you may use the `transformers`-style or LangChain APIs. ## Convert Model diff --git a/docs/mddocs/Overview/KeyFeatures/finetune.md b/docs/mddocs/Overview/KeyFeatures/finetune.md index b895b89f..f8dbe35c 100644 --- a/docs/mddocs/Overview/KeyFeatures/finetune.md +++ b/docs/mddocs/Overview/KeyFeatures/finetune.md @@ -2,21 +2,15 @@ We also support finetuning LLMs (large language models) using QLoRA with IPEX-LLM 4bit optimizations on Intel GPUs. -```eval_rst -.. note:: - - Currently, only Hugging Face Transformers models are supported running QLoRA finetuning. -``` +> [!NOTE] +> Currently, only Hugging Face Transformers models are supported running QLoRA finetuning. To help you better understand the finetuning process, here we use model [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) as an example. -**Make sure you have prepared environment following instructions [here](../install_gpu.html).** +**Make sure you have prepared environment following instructions [here](../install_gpu.md).** -```eval_rst -.. note:: - - If you are using an older version of ``ipex-llm`` (specifically, older than 2.5.0b20240104), you need to manually add ``import intel_extension_for_pytorch as ipex`` at the beginning of your code. -``` +> [!NOTE] +> If you are using an older version of `ipex-llm` (specifically, older than 2.5.0b20240104), you need to manually add `import intel_extension_for_pytorch as ipex` at the beginning of your code. First, load model using `transformers`-style API and **set it to `to('xpu')`**. We specify `load_in_low_bit="nf4"` here to apply 4-bit NormalFloat optimization. According to the [QLoRA paper](https://arxiv.org/pdf/2305.14314.pdf), using `"nf4"` could yield better model quality than `"int4"`. @@ -32,6 +26,7 @@ model = model.to('xpu') ``` Then, we have to apply some preprocessing to the model to prepare it for training. + ```python from ipex_llm.transformers.qlora import prepare_model_for_kbit_training model.gradient_checkpointing_enable() @@ -39,6 +34,7 @@ model = prepare_model_for_kbit_training(model) ``` Next, we can obtain a Peft model from the optimized model and a configuration object containing the parameters as follows: + ```python from ipex_llm.transformers.qlora import get_peft_model from peft import LoraConfig @@ -51,14 +47,8 @@ config = LoraConfig(r=8, model = get_peft_model(model, config) ``` -```eval_rst -.. important:: +> [!IMPORTANT] +> Instead of `from peft import prepare_model_for_kbit_training, get_peft_model` as we did for regular QLoRA using bitandbytes and cuda, we import them from `ipex_llm.transformers.qlora` here to get a IPEX-LLM compatible Peft model. And the rest is just the same as regular LoRA finetuning process using `peft`. - Instead of ``from peft import prepare_model_for_kbit_training, get_peft_model`` as we did for regular QLoRA using bitandbytes and cuda, we import them from ``ipex_llm.transformers.qlora`` here to get a IPEX-LLM compatible Peft model. And the rest is just the same as regular LoRA finetuning process using ``peft``. -``` - -```eval_rst -.. seealso:: - - See the complete examples `here `_ -``` +> [!TIP] +> See the complete examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU) \ No newline at end of file diff --git a/docs/mddocs/Overview/KeyFeatures/gpu_supports.md b/docs/mddocs/Overview/KeyFeatures/gpu_supports.md new file mode 100644 index 00000000..a6c5da3d --- /dev/null +++ b/docs/mddocs/Overview/KeyFeatures/gpu_supports.md @@ -0,0 +1,7 @@ +# GPU Supports + +IPEX-LLM not only supports running large language models for inference, but also supports QLoRA finetuning on Intel GPUs. + +* [Inference on GPU](./inference_on_gpu.md) +* [Finetune (QLoRA)](./finetune.md) +* [Multi GPUs selection](./multi_gpus_selection.md) \ No newline at end of file diff --git a/docs/mddocs/Overview/KeyFeatures/gpu_supports.rst b/docs/mddocs/Overview/KeyFeatures/gpu_supports.rst deleted file mode 100644 index 6828cb05..00000000 --- a/docs/mddocs/Overview/KeyFeatures/gpu_supports.rst +++ /dev/null @@ -1,14 +0,0 @@ -GPU Supports -================================ - -IPEX-LLM not only supports running large language models for inference, but also supports QLoRA finetuning on Intel GPUs. - -* |inference_on_gpu|_ -* `Finetune (QLoRA) <./finetune.html>`_ -* `Multi GPUs selection <./multi_gpus_selection.html>`_ - -.. |inference_on_gpu| replace:: Inference on GPU -.. _inference_on_gpu: ./inference_on_gpu.html - -.. |multi_gpus_selection| replace:: Multi GPUs selection -.. _multi_gpus_selection: ./multi_gpus_selection.html diff --git a/docs/mddocs/Overview/KeyFeatures/hugging_face_format.md b/docs/mddocs/Overview/KeyFeatures/hugging_face_format.md index 0eee498f..14d19fab 100644 --- a/docs/mddocs/Overview/KeyFeatures/hugging_face_format.md +++ b/docs/mddocs/Overview/KeyFeatures/hugging_face_format.md @@ -22,21 +22,18 @@ output_ids = model.generate(input_ids, ...) output = tokenizer.batch_decode(output_ids) ``` -```eval_rst -.. seealso:: +> [!TIP] +> See the complete CPU examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels) and GPU examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels>). - See the complete CPU examples `here `_ and GPU examples `here `_. +> [!NOTE] +> You may apply more low bit optimizations (including INT8, INT5 and INT4) as follows: +> +> ```python +> model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int5") +> ``` +> +> See the CPU example [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/More-Data-Types) and GPU example [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types). -.. note:: - - You may apply more low bit optimizations (including INT8, INT5 and INT4) as follows: - - .. code-block:: python - - model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int5") - - See the CPU example `here `_ and GPU example `here `_. -``` ## Save & Load After the model is optimized using INT4 (or INT8/INT5), you may save and load the optimized model as follows: @@ -47,8 +44,5 @@ model.save_low_bit(model_path) new_model = AutoModelForCausalLM.load_low_bit(model_path) ``` -```eval_rst -.. seealso:: - - See the CPU example `here `_ and GPU example `here `_ -``` \ No newline at end of file +> [!TIP] +> See the complete CPU examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Save-Load) and GPU examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Save-Load). diff --git a/docs/mddocs/Overview/KeyFeatures/index.md b/docs/mddocs/Overview/KeyFeatures/index.md new file mode 100644 index 00000000..6107ab3c --- /dev/null +++ b/docs/mddocs/Overview/KeyFeatures/index.md @@ -0,0 +1,13 @@ +# IPEX-LLM Key Features + +You may run the LLMs using `ipex-llm` through one of the following APIs: + +* [PyTorch API](./optimize_model.md) +* [`transformers`-style API](./transformers_style_api.md) + * [Hugging Face `transformers` Format](./hugging_face_format.md) + * [Native Format](./native_format.md) +* [LangChain API](./langchain_api.md) +* [GPU Supports](./gpu_supports.md) + * [Inference on GPU](./inference_on_gpu.md) + * [Finetune (QLoRA)](./finetune.md) + * [Multi GPUs selection](./multi_gpus_selection.md) diff --git a/docs/mddocs/Overview/KeyFeatures/index.rst b/docs/mddocs/Overview/KeyFeatures/index.rst deleted file mode 100644 index 8611f9bd..00000000 --- a/docs/mddocs/Overview/KeyFeatures/index.rst +++ /dev/null @@ -1,33 +0,0 @@ -IPEX-LLM Key Features -================================ - -You may run the LLMs using ``ipex-llm`` through one of the following APIs: - -* `PyTorch API <./optimize_model.html>`_ -* |transformers_style_api|_ - - * |hugging_face_transformers_format|_ - * `Native Format <./native_format.html>`_ - -* `LangChain API <./langchain_api.html>`_ -* |gpu_supports|_ - - * |inference_on_gpu|_ - * `Finetune (QLoRA) <./finetune.html>`_ - * `Multi GPUs selection <./multi_gpus_selection.html>`_ - - -.. |transformers_style_api| replace:: ``transformers``-style API -.. _transformers_style_api: ./transformers_style_api.html - -.. |hugging_face_transformers_format| replace:: Hugging Face ``transformers`` Format -.. _hugging_face_transformers_format: ./hugging_face_format.html - -.. |gpu_supports| replace:: GPU Supports -.. _gpu_supports: ./gpu_supports.html - -.. |inference_on_gpu| replace:: Inference on GPU -.. _inference_on_gpu: ./inference_on_gpu.html - -.. |multi_gpus_selection| replace:: Multi GPUs selection -.. _multi_gpus_selection: ./multi_gpus_selection.html diff --git a/docs/mddocs/Overview/KeyFeatures/inference_on_gpu.md b/docs/mddocs/Overview/KeyFeatures/inference_on_gpu.md index 1a9638e9..126dc2af 100644 --- a/docs/mddocs/Overview/KeyFeatures/inference_on_gpu.md +++ b/docs/mddocs/Overview/KeyFeatures/inference_on_gpu.md @@ -4,95 +4,86 @@ Apart from the significant acceleration capabilites on Intel CPUs, IPEX-LLM also Compared with running on Intel CPUs, some additional operations are required on Intel GPUs. To help you better understand the process, here we use a popular model [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) as an example. -**Make sure you have prepared environment following instructions [here](../install_gpu.html).** +**Make sure you have prepared environment following instructions [here](../install_gpu.md).** -```eval_rst -.. note:: - - If you are using an older version of ``ipex-llm`` (specifically, older than 2.5.0b20240104), you need to manually add ``import intel_extension_for_pytorch as ipex`` at the beginning of your code. -``` +> [!NOTE] +> If you are using an older version of `ipex-llm` (specifically, older than 2.5.0b20240104), you need to manually add `import intel_extension_for_pytorch as ipex` at the beginning of your code. ## Load and Optimize Model -You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-style API](./transformers_style_api.html) on Intel GPUs according to your preference. +You could choose to use [PyTorch API](./optimize_model.md) or [`transformers`-style API](./transformers_style_api.md) on Intel GPUs according to your preference. **Once you have the model with IPEX-LLM low bit optimization, set it to `to('xpu')`**. -```eval_rst -.. tabs:: +- For **PyTorch API**: - .. tab:: PyTorch API + You could optimize any PyTorch model with "one-line code change", and the loading and optimizing process on Intel GPUs maybe as follows: - You could optimize any PyTorch model with "one-line code change", and the loading and optimizing process on Intel GPUs maybe as follows: - - .. code-block:: python + ```python + # Take Llama-2-7b-chat-hf as an example + from transformers import LlamaForCausalLM + from ipex_llm import optimize_model - # Take Llama-2-7b-chat-hf as an example - from transformers import LlamaForCausalLM - from ipex_llm import optimize_model + model = LlamaForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', torch_dtype='auto', low_cpu_mem_usage=True) + model = optimize_model(model) # With only one line to enable IPEX-LLM INT4 optimization - model = LlamaForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', torch_dtype='auto', low_cpu_mem_usage=True) - model = optimize_model(model) # With only one line to enable IPEX-LLM INT4 optimization + model = model.to('xpu') # Important after obtaining the optimized model + ``` - model = model.to('xpu') # Important after obtaining the optimized model + > **Tip**" + > + > When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the `optimize_model` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. + > + > See the [API doc](https://ipex-llm.readthedocs.io/en/latest/doc/PythonAPI/LLM/optimize.html) for ``optimize_model`` to find more information. - .. tip:: + Especially, if you have saved the optimized model following setps [here](./optimize_model.md#save), the loading process on Intel GPUs maybe as follows: - When running LLMs on Intel iGPUs for Windows users, we recommend setting ``cpu_embedding=True`` in the ``optimize_model`` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. - - See the `API doc <../../../PythonAPI/LLM/optimize.html#ipex_llm.optimize_model>`_ for ``optimize_model`` to find more information. + ```python + from transformers import LlamaForCausalLM + from ipex_llm.optimize import low_memory_init, load_low_bit - Especially, if you have saved the optimized model following setps `here <./optimize_model.html#save>`_, the loading process on Intel GPUs maybe as follows: + saved_dir='./llama-2-ipex-llm-4-bit' + with low_memory_init(): # Fast and low cost by loading model on meta device + model = LlamaForCausalLM.from_pretrained(saved_dir, + torch_dtype="auto", + trust_remote_code=True) + model = load_low_bit(model, saved_dir) # Load the optimized model - .. code-block:: python + model = model.to('xpu') # Important after obtaining the optimized model + ``` - from transformers import LlamaForCausalLM - from ipex_llm.optimize import low_memory_init, load_low_bit +- For **``transformers``-style API**: - saved_dir='./llama-2-ipex-llm-4-bit' - with low_memory_init(): # Fast and low cost by loading model on meta device - model = LlamaForCausalLM.from_pretrained(saved_dir, - torch_dtype="auto", - trust_remote_code=True) - model = load_low_bit(model, saved_dir) # Load the optimized model + You could run any Hugging Face Transformers model with `transformers`-style API, and the loading and optimizing process on Intel GPUs maybe as follows: + + ```python + # Take Llama-2-7b-chat-hf as an example + from ipex_llm.transformers import AutoModelForCausalLM - model = model.to('xpu') # Important after obtaining the optimized model + # Load model in 4 bit, which convert the relevant layers in the model into INT4 format + model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', load_in_4bit=True) - .. tab:: ``transformers``-style API + model = model.to('xpu') # Important after obtaining the optimized model + ``` - You could run any Hugging Face Transformers model with ``transformers``-style API, and the loading and optimizing process on Intel GPUs maybe as follows: - - .. code-block:: python + > [!TIP] + > When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the `from_pretrained` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. + > + > See the [API doc](https://ipex-llm.readthedocs.io/en/latest/doc/PythonAPI/LLM/transformers.html) to find more information. - # Take Llama-2-7b-chat-hf as an example - from ipex_llm.transformers import AutoModelForCausalLM + Especially, if you have saved the optimized model following setps [here](./hugging_face_format.md#save--load), the loading process on Intel GPUs maybe as follows: - # Load model in 4 bit, which convert the relevant layers in the model into INT4 format - model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', load_in_4bit=True) + ```python + from ipex_llm.transformers import AutoModelForCausalLM - model = model.to('xpu') # Important after obtaining the optimized model + saved_dir='./llama-2-ipex-llm-4-bit' + model = AutoModelForCausalLM.load_low_bit(saved_dir) # Load the optimized model - .. tip:: - - When running LLMs on Intel iGPUs for Windows users, we recommend setting ``cpu_embedding=True`` in the ``from_pretrained`` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. - - See the `API doc <../../../PythonAPI/LLM/transformers.html#hugging-face-transformers-automodel>`_ to find more information. - - Especially, if you have saved the optimized model following setps `here <./hugging_face_format.html#save-load>`_, the loading process on Intel GPUs maybe as follows: - - .. code-block:: python - - from ipex_llm.transformers import AutoModelForCausalLM - - saved_dir='./llama-2-ipex-llm-4-bit' - model = AutoModelForCausalLM.load_low_bit(saved_dir) # Load the optimized model - - model = model.to('xpu') # Important after obtaining the optimized model - - .. tip:: - - When running saved optimized models on Intel iGPUs for Windows users, we also recommend setting ``cpu_embedding=True`` in the ``load_low_bit`` function. -``` + model = model.to('xpu') # Important after obtaining the optimized model + ``` + > [!TIP] + > + > When running saved optimized models on Intel iGPUs for Windows users, we also recommend setting `cpu_embedding=True` in the `load_low_bit` function. ## Run Optimized Model @@ -109,20 +100,11 @@ with torch.inference_mode(): output_str = tokenizer.decode(output[0], skip_special_tokens=True) ``` -```eval_rst -.. note:: +> [!NOTE] +> The initial generation of optimized LLMs on Intel GPUs could be slow. Therefore, it's recommended to perform a **warm-up** run before the actual generation. - The initial generation of optimized LLMs on Intel GPUs could be slow. Therefore, it's recommended to perform a **warm-up** run before the actual generation. -``` +> [!NOTE] +> If you are a Windows user, please also note that for **the first time** that **each model** runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. -```eval_rst -.. note:: - - If you are a Windows user, please also note that for **the first time** that **each model** runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. -``` - -```eval_rst -.. seealso:: - - See the complete examples `here `_ -``` +> [!TIP] +> See the complete examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU). \ No newline at end of file diff --git a/docs/mddocs/Overview/KeyFeatures/langchain_api.md b/docs/mddocs/Overview/KeyFeatures/langchain_api.md index 46a7adb3..099fa8f5 100644 --- a/docs/mddocs/Overview/KeyFeatures/langchain_api.md +++ b/docs/mddocs/Overview/KeyFeatures/langchain_api.md @@ -18,23 +18,16 @@ doc_chain = load_qa_chain(ipex_llm, ...) output = doc_chain.run(...) ``` -```eval_rst -.. seealso:: - - See the examples `here `_. -``` +> [!TIP] +> See the examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/LangChain/transformers_int4) ## Using Native INT4 Format You may also convert Hugging Face *Transformers* models into native INT4 format, and then run the converted models using the LangChain API as follows. -```eval_rst -.. note:: - - * Currently only llama/bloom/gptneox/starcoder model families are supported; for other models, you may use the Hugging Face ``transformers`` INT4 format as described `above <./langchain_api.html#using-hugging-face-transformers-int4-format>`_. - - * You may choose the corresponding API developed for specific native models to load the converted model. -``` +> [!NOTE] +> - Currently only llama/bloom/gptneox/starcoder model families are supported; for other models, you may use the Hugging Face ``transformers`` INT4 format as described [above](./langchain_api.md#using-hugging-face-transformers-int4-format). +> - You may choose the corresponding API developed for specific native models to load the converted model. ```python from ipex_llm.langchain.llms import LlamaLLM @@ -50,8 +43,5 @@ doc_chain = load_qa_chain(ipex_llm, ...) doc_chain.run(...) ``` -```eval_rst -.. seealso:: - - See the examples `here `_. -``` +> [!TIP] +> See the examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/LangChain/native_int4) for more information.