diff --git a/docs/mddocs/DockerGuides/index.md b/docs/mddocs/DockerGuides/README.md similarity index 100% rename from docs/mddocs/DockerGuides/index.md rename to docs/mddocs/DockerGuides/README.md diff --git a/docs/mddocs/Overview/KeyFeatures/index.md b/docs/mddocs/Overview/KeyFeatures/README.md similarity index 100% rename from docs/mddocs/Overview/KeyFeatures/index.md rename to docs/mddocs/Overview/KeyFeatures/README.md diff --git a/docs/mddocs/Overview/KeyFeatures/inference_on_gpu.md b/docs/mddocs/Overview/KeyFeatures/inference_on_gpu.md index 6684b5bf..e152c228 100644 --- a/docs/mddocs/Overview/KeyFeatures/inference_on_gpu.md +++ b/docs/mddocs/Overview/KeyFeatures/inference_on_gpu.md @@ -34,7 +34,7 @@ You could choose to use [PyTorch API](./optimize_model.md) or [`transformers`-st > > When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the `optimize_model` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. > - > See the [API doc](https://ipex-llm.readthedocs.io/en/latest/doc/PythonAPI/LLM/optimize.html) for ``optimize_model`` to find more information. + > See the [API doc](../../PythonAPI/optimize.md) for ``optimize_model`` to find more information. Especially, if you have saved the optimized model following setps [here](./optimize_model.md#save), the loading process on Intel GPUs maybe as follows: @@ -70,7 +70,7 @@ You could choose to use [PyTorch API](./optimize_model.md) or [`transformers`-st > > When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the `from_pretrained` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. > - > See the [API doc](https://ipex-llm.readthedocs.io/en/latest/doc/PythonAPI/LLM/transformers.html) to find more information. + > See the [API doc](../../PythonAPI/transformers.md) to find more information. Especially, if you have saved the optimized model following setps [here](./hugging_face_format.md#save--load), the loading process on Intel GPUs maybe as follows: diff --git a/docs/mddocs/Overview/KeyFeatures/optimize_model.md b/docs/mddocs/Overview/KeyFeatures/optimize_model.md index d3098c64..885002da 100644 --- a/docs/mddocs/Overview/KeyFeatures/optimize_model.md +++ b/docs/mddocs/Overview/KeyFeatures/optimize_model.md @@ -61,6 +61,6 @@ model = load_low_bit(model, saved_dir) # Load the optimized model > [!NOTE] -> - Please refer to the [API documentation](https://ipex-llm.readthedocs.io/en/latest/doc/PythonAPI/LLM/optimize.html) for more details. +> - Please refer to the [API documentation](../../PythonAPI/optimize.md) for more details. > - We also provide detailed examples on how to run PyTorch models (e.g., Openai Whisper, LLaMA2, ChatGLM2, Falcon, MPT, Baichuan2, etc.) using IPEX-LLM. See the complete CPU examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models) and GPU examples [here](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models) diff --git a/docs/mddocs/PythonAPI/PyTorch-API.md b/docs/mddocs/PythonAPI/PyTorch-API.md deleted file mode 100644 index 60d39897..00000000 --- a/docs/mddocs/PythonAPI/PyTorch-API.md +++ /dev/null @@ -1,85 +0,0 @@ -# IPEX-LLM PyTorch API - -## Optimize Model -You can run any PyTorch model with `optimize_model` through only one-line code change to benefit from IPEX-LLM optimization, regardless of the library or API you are using. - -### `ipex_llm.optimize_model`_`(model, low_bit='sym_int4', optimize_llm=True, modules_to_not_convert=None, cpu_embedding=False, lightweight_bmm=False, **kwargs)`_ - -A method to optimize any pytorch model. - -- **Parameters**: - - - **model**: The original PyTorch model (nn.module) - - - **low_bit**: str value, options are `'sym_int4'`, `'asym_int4'`, `'sym_int5'`, `'asym_int5'`, `'sym_int8'`, `'nf3'`, `'nf4'`, `'fp4'`, `'fp8'`, `'fp8_e4m3'`, `'fp8_e5m2'`, `'fp16'` or `'bf16'`, `'sym_int4'` means symmetric int 4, `'asym_int4'` means asymmetric int 4, `'nf4'` means 4-bit NormalFloat, etc. Relevant low bit optimizations will be applied to the model. - - - **optimize_llm**: Whether to further optimize llm model. - - Default to be `True`. - - - **modules_to_not_convert**: list of str value, modules (`nn.Module`) that are skipped when conducting model optimizations. - - Default to be `None`. - - - **cpu_embedding**: Whether to replace the Embedding layer, may need to set it to `True` when running BigDL-LLM on GPU on Windows. - - Default to be `False`. - - - **lightweight_bmm**: Whether to replace the `torch.bmm` ops, may need to set it to `True` when running BigDL-LLM on GPU on Windows. - - Default to be `False`. - -- **Returns**: The optimized model. - -- **Example**: - ```python - # Take OpenAI Whisper model as an example - from ipex_llm import optimize_model - model = whisper.load_model('tiny') # Load whisper model under pytorch framework - model = optimize_model(model) # With only one line code change - # Use the optimized model without other API change - result = model.transcribe(audio, verbose=True, language="English") - # (Optional) you can also save the optimized model by calling 'save_low_bit' - model.save_low_bit(saved_dir) - ``` - -## Load Optimized Model - -To avoid high resource consumption during the loading processes of the original model, we provide save/load API to support the saving of model after low-bit optimization and the loading of the saved low-bit model. Saving and loading operations are platform-independent, regardless of their operating systems. - -### `ipex_llm.optimize.load_low_bit`_`(model, model_path)`_ - -Load the optimized pytorch model. - -- **Parameters**: - - - **model**: The PyTorch model instance. - - - **model_path**: The path of saved optimized model. - - -- **Returns**: The optimized model. - -- **Example**: - ```python - # Example 1: - # Take ChatGLM2-6B model as an example - # Make sure you have saved the optimized model by calling 'save_low_bit' - from ipex_llm.optimize import low_memory_init, load_low_bit - with low_memory_init(): # Fast and low cost by loading model on meta device - model = AutoModel.from_pretrained(saved_dir, - torch_dtype="auto", - trust_remote_code=True) - model = load_low_bit(model, saved_dir) # Load the optimized model - ``` - - ```python - # Example 2: - # If the model doesn't fit 'low_memory_init' method, - # alternatively, you can obtain the model instance through traditional loading method. - # Take OpenAI Whisper model as an example - # Make sure you have saved the optimized model by calling 'save_low_bit' - from ipex_llm.optimize import load_low_bit - model = whisper.load_model('tiny') # A model instance through traditional loading method - model = load_low_bit(model, saved_dir) # Load the optimized model - ``` \ No newline at end of file diff --git a/docs/mddocs/PythonAPI/README.md b/docs/mddocs/PythonAPI/README.md new file mode 100644 index 00000000..b83c0bbd --- /dev/null +++ b/docs/mddocs/PythonAPI/README.md @@ -0,0 +1,22 @@ +# IPEX-LLM API + +- [IPEX-LLM `transformers`-style API](./transformers.md) + + - [Hugging Face `transformers` AutoModel](./transformers.md#hugging-face-transformers-automodel) + + - AutoModelForCausalLM + - AutoModel + - AutoModelForSpeechSeq2Seq + - AutoModelForSeq2SeqLM + - AutoModelForSequenceClassification + - AutoModelForMaskedLM + - AutoModelForQuestionAnswering + - AutoModelForNextSentencePrediction + - AutoModelForMultipleChoice + - AutoModelForTokenClassification + +- [IPEX-LLM PyTorch API](./optimize.md) + + - [Optimize Model](./optimize.md#optimize-model) + + - [Load Optimized Model](./optimize.md#load-optimized-model) \ No newline at end of file diff --git a/docs/mddocs/PythonAPI/optimize.md b/docs/mddocs/PythonAPI/optimize.md new file mode 100644 index 00000000..993f6e82 --- /dev/null +++ b/docs/mddocs/PythonAPI/optimize.md @@ -0,0 +1,79 @@ +# IPEX-LLM PyTorch API + +## Optimize Model +You can run any PyTorch model with `optimize_model` through only one-line code change to benefit from IPEX-LLM optimization, regardless of the library or API you are using. + +### `ipex_llm.optimize_model`_`(model, low_bit='sym_int4', optimize_llm=True, modules_to_not_convert=None, cpu_embedding=False, lightweight_bmm=False, **kwargs)`_ + +A method to optimize any pytorch model. + +- **Parameters**: + + - **model**: The original PyTorch model (nn.module) + + - **low_bit**: str value, options are `'sym_int4'`, `'asym_int4'`, `'sym_int5'`, `'asym_int5'`, `'sym_int8'`, `'nf3'`, `'nf4'`, `'fp4'`, `'fp8'`, `'fp8_e4m3'`, `'fp8_e5m2'`, `'fp16'` or `'bf16'`, `'sym_int4'` means symmetric int 4, `'asym_int4'` means asymmetric int 4, `'nf4'` means 4-bit NormalFloat, etc. Relevant low bit optimizations will be applied to the model. + + - **optimize_llm**: Whether to further optimize llm model. Default to be `True`. + + - **modules_to_not_convert**: list of str value, modules (`nn.Module`) that are skipped when conducting model optimizations. Default to be `None`. + + - **cpu_embedding**: Whether to replace the Embedding layer, may need to set it to `True` when running IPEX-LLM on GPU. Default to be `False`. + + - **lightweight_bmm**: Whether to replace the `torch.bmm` ops, may need to set it to `True` when running IPEX-LLM on GPU on Windows. Default to be `False`. + +- **Returns**: The optimized model. + +- **Example**: + + ```python + # Take OpenAI Whisper model as an example + from ipex_llm import optimize_model + model = whisper.load_model('tiny') # Load whisper model under pytorch framework + model = optimize_model(model) # With only one line code change + # Use the optimized model without other API change + result = model.transcribe(audio, verbose=True, language="English") + # (Optional) you can also save the optimized model by calling 'save_low_bit' + model.save_low_bit(saved_dir) + ``` + +## Load Optimized Model + +To avoid high resource consumption during the loading processes of the original model, we provide save/load API to support the saving of model after low-bit optimization and the loading of the saved low-bit model. Saving and loading operations are platform-independent, regardless of their operating systems. + +### `ipex_llm.optimize.load_low_bit`_`(model, model_path)`_ + +Load the optimized pytorch model. + +- **Parameters**: + + - **model**: The PyTorch model instance. + + - **model_path**: The path of saved optimized model. + + +- **Returns**: The optimized model. + +- **Example**: + + ```python + # Example 1: + # Take ChatGLM2-6B model as an example + # Make sure you have saved the optimized model by calling 'save_low_bit' + from ipex_llm.optimize import low_memory_init, load_low_bit + with low_memory_init(): # Fast and low cost by loading model on meta device + model = AutoModel.from_pretrained(saved_dir, + torch_dtype="auto", + trust_remote_code=True) + model = load_low_bit(model, saved_dir) # Load the optimized model + ``` + + ```python + # Example 2: + # If the model doesn't fit 'low_memory_init' method, + # alternatively, you can obtain the model instance through traditional loading method. + # Take OpenAI Whisper model as an example + # Make sure you have saved the optimized model by calling 'save_low_bit' + from ipex_llm.optimize import load_low_bit + model = whisper.load_model('tiny') # A model instance through traditional loading method + model = load_low_bit(model, saved_dir) # Load the optimized model + ``` \ No newline at end of file diff --git a/docs/mddocs/Quickstart/index.md b/docs/mddocs/Quickstart/README.md similarity index 100% rename from docs/mddocs/Quickstart/index.md rename to docs/mddocs/Quickstart/README.md