diff --git a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/finetune.md b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/finetune.md
index 69c926eb..70157907 100644
--- a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/finetune.md
+++ b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/finetune.md
@@ -10,16 +10,17 @@ We also support finetuning LLMs (large language models) using QLoRA with BigDL-L
 
 To help you better understand the finetuning process, here we use model [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) as an example.
 
-**Make sure you have prepared environment following instructions [here](../install_gpu.html). First of all, you need to import `intel_extension_for_pytorch` to run on Intel GPUs**:
+**Make sure you have prepared environment following instructions [here](../install_gpu.html).**
 
-```python
-import intel_extension_for_pytorch as ipex
+```eval_rst
+.. note::
+
+   If you are using an older version of ``bigdl-llm`` (specifically, older than 2.5.0b20240104), you need to manually add ``import intel_extension_for_pytorch as ipex`` at the beginning of your code.
 ```
 
 First, load model using `transformers`-style API and **set it to `to('xpu')`**. We specify `load_in_low_bit="nf4"` here to apply 4-bit NormalFloat optimization. According to the [QLoRA paper](https://arxiv.org/pdf/2305.14314.pdf), using `"nf4"` could yield better model quality than `"int4"`.
 
 ```python
-import intel_extension_for_pytorch as ipex
 from bigdl.llm.transformers import AutoModelForCausalLM
 
 model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf",
diff --git a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/inference_on_gpu.md b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/inference_on_gpu.md
index 8ca89d8c..966b6a2e 100644
--- a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/inference_on_gpu.md
+++ b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/inference_on_gpu.md
@@ -4,10 +4,12 @@ Apart from the significant acceleration capabilites on Intel CPUs, BigDL-LLM als
 
 Compared with running on Intel CPUs, some additional operations are required on Intel GPUs. To help you better understand the process, here we use a popular model [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) as an example.
 
-**Make sure you have prepared environment following instructions [here](../install_gpu.html). First of all, you need to import `intel_extension_for_pytorch` to run on Intel GPUs**:
+**Make sure you have prepared environment following instructions [here](../install_gpu.html).**
 
-```python
-import intel_extension_for_pytorch as ipex
+```eval_rst
+.. note::
+
+   If you are using an older version of ``bigdl-llm`` (specifically, older than 2.5.0b20240104), you need to manually add ``import intel_extension_for_pytorch as ipex`` at the beginning of your code.
 ```
 
 ## Load and Optimize Model
@@ -26,7 +28,6 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-
       .. code-block:: python
 
          # Take Llama-2-7b-chat-hf as an example
-         import intel_extension_for_pytorch as ipex
          from transformers import LlamaForCausalLM
          from bigdl.llm import optimize_model
 
@@ -35,11 +36,16 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-
 
          model = model.to('xpu') # Important after obtaining the optimized model
 
+      .. tip::
+
+         When running LLMs on Intel iGPUs for Windows users, we recommend setting ``cpu_embedding=True`` in the ``optimize_model`` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
+         
+         See the `API doc <../../../PythonAPI/LLM/optimize.html#bigdl.llm.optimize_model>`_ for ``optimize_model`` to find more information.
+
       Especially, if you have saved the optimized model following setps `here <./optimize_model.html#save>`_, the loading process on Intel GPUs maybe as follows:
 
       .. code-block:: python
 
-         import intel_extension_for_pytorch as ipex
          from transformers import LlamaForCausalLM
          from bigdl.llm.optimize import low_memory_init, load_low_bit
 
@@ -59,7 +65,6 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-
       .. code-block:: python
 
          # Take Llama-2-7b-chat-hf as an example
-         import intel_extension_for_pytorch as ipex
          from bigdl.llm.transformers import AutoModelForCausalLM
 
          # Load model in 4 bit, which convert the relevant layers in the model into INT4 format
@@ -67,17 +72,26 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-
 
          model = model.to('xpu') # Important after obtaining the optimized model
 
+      .. tip::
+
+         When running LLMs on Intel iGPUs for Windows users, we recommend setting ``cpu_embedding=True``` in the ``from_pretrained`` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
+         
+         See the `API doc <../../../PythonAPI/LLM/transformers.html#hugging-face-transformers-automodel>`_ to find more information.
+
       Especially, if you have saved the optimized model following setps `here <./hugging_face_format.html#save-load>`_, the loading process on Intel GPUs maybe as follows:
 
       .. code-block:: python
 
-         import intel_extension_for_pytorch as ipex
          from bigdl.llm.transformers import AutoModelForCausalLM
 
          saved_dir='./llama-2-bigdl-llm-4-bit'
          model = AutoModelForCausalLM.load_low_bit(saved_dir) # Load the optimized model
 
          model = model.to('xpu') # Important after obtaining the optimized model
+
+      .. tip::
+
+         When running saved optimized models on Intel iGPUs for Windows users, we also recommend setting ``cpu_embedding=True`` in the ``load_low_bit`` function.
 ```
 
 ## Run Optimized Model
@@ -101,6 +115,11 @@ with torch.inference_mode():
    The initial generation of optimized LLMs on Intel GPUs could be slow. Therefore, it's recommended to perform a **warm-up** run before the actual generation.
 ```
 
+```eval_rst
+.. note::
+
+   If you are a Windows user, please also note that for **the first time** that **each model** runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
+```
 
 ```eval_rst
 .. seealso::
diff --git a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/multi_gpus_selection.md b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/multi_gpus_selection.md
index 3eb48769..88aee516 100644
--- a/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/multi_gpus_selection.md
+++ b/docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/multi_gpus_selection.md
@@ -5,10 +5,26 @@ In [Inference on GPU](inference_on_gpu.md) and [Finetune (QLoRA)](finetune.md),
 ## List devices
 
 The `sycl-ls` tool enumerates a list of devices available in the system. You can use it after you setup oneapi environment:
-```bash
-source /opt/intel/oneapi/setvars.sh
-sycl-ls
+
+```eval_rst
+.. tabs::
+   .. tab:: Windows
+
+      Please make sure you are using CMD (Anaconda Prompt if using conda):
+
+      .. code-block:: cmd
+
+        call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
+        sycl-ls
+
+   .. tab:: Linux
+
+      .. code-block:: bash
+
+         source /opt/intel/oneapi/setvars.sh
+         sycl-ls
 ```
+
 If you have two Arc770 GPUs, you can get something like below:
 ```
 [opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
@@ -20,7 +36,7 @@ If you have two Arc770 GPUs, you can get something like below:
 [ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
 [ext_oneapi_level_zero:gpu:2] Intel(R) Level-Zero, Intel(R) UHD Graphics 770 1.3 [1.3.26241]
 ```
-This output shows there are two Arc A770 GPUs on this machine.
+This output shows there are two Arc A770 GPUs as well as an Intel iGPU on this machine.
 
 ## Devices selection
 To enable xpu, you should convert your model and input to xpu by below code:
@@ -31,7 +47,8 @@ input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
 To select the desired devices, there are two ways: one is changing the code, another is adding an environment variable. See:  
 
 ### 1. Select device in python
-To specify a xpu, you can change the `to('xpu')` to `to('xpu:[device_id]')`, this device_id is counted from zero.  
+To specify a xpu, you can change the `to('xpu')` to `to('xpu:[device_id]')`, this device_id is counted from zero.
+
 If you you want to use the second device, you can change the code like this: 
 ```
 model = model.to('xpu:1')
@@ -41,11 +58,29 @@ input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu:1')
 ### 2. OneAPI device selector
 Device selection environment variable, `ONEAPI_DEVICE_SELECTOR`, can be used to limit the choice of Intel GPU devices. As upon `sycl-ls` shows, the last three lines are three Level Zero GPU devices. So we can use `ONEAPI_DEVICE_SELECTOR=level_zero:[gpu_id]` to select devices.
 For example, you want to use the second A770 GPU, you can run the python like this:
-```
-ONEAPI_DEVICE_SELECTOR=level_zero:1 python generate.py
-```
-`ONEAPI_DEVICE_SELECTOR=level_zero:1` in upon command only affect in current python program. Also, you can export the environment, then run your python:
-```
-export ONEAPI_DEVICE_SELECTOR=level_zero:1
-python generate.py
+
+```eval_rst
+.. tabs::
+   .. tab:: Windows
+
+      .. code-block:: cmd
+
+         set ONEAPI_DEVICE_SELECTOR=level_zero:1 
+         python generate.py
+
+      Through ``set ONEAPI_DEVICE_SELECTOR=level_zero:1``, only the second A770 GPU will be available for the current environment.
+
+   .. tab:: Linux
+
+      .. code-block:: bash
+
+         ONEAPI_DEVICE_SELECTOR=level_zero:1 python generate.py
+
+      ``ONEAPI_DEVICE_SELECTOR=level_zero:1`` in upon command only affect in current python program. Also, you can export the environment variable, then run your python:
+
+      .. code-block:: bash
+
+         export ONEAPI_DEVICE_SELECTOR=level_zero:1
+         python generate.py
+
 ```
diff --git a/docs/readthedocs/source/doc/LLM/Overview/install_gpu.md b/docs/readthedocs/source/doc/LLM/Overview/install_gpu.md
index 1febe099..47b7a187 100644
--- a/docs/readthedocs/source/doc/LLM/Overview/install_gpu.md
+++ b/docs/readthedocs/source/doc/LLM/Overview/install_gpu.md
@@ -61,32 +61,74 @@ pip install intel_extension_for_pytorch-2.1.10+xpu-cp39-cp39-win_amd64.whl
 pip install --pre --upgrade bigdl-llm[xpu]
 ```
 
+```eval_rst
+.. note::
+
+   All the wheel packages mentioned here are for Python 3.9. If you would like to use Python 3.10 or 3.11, you should modify the wheel names for ``torch``, ``torchvision``, and ``intel_extension_for_pytorch`` by replacing ``cp39`` with ``cp310`` or ``cp311``, respectively.
+```
+
 ### Runtime Configuration
 
 To use GPU acceleration on Windows, several environment variables are required before running a GPU example.
 
-Make sure you are using CMD as PowerShell is not supported:
+Make sure you are using CMD (Anaconda Prompt if using conda) as PowerShell is not supported:
 
-```
+```cmd
 call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 ```
 
-Please also set the following environment variable for iGPU:
+Please also set the following environment variable if you would like to run LLMs on:
 
-```
-set SYCL_CACHE_PERSISTENT=1
-set BIGDL_LLM_XMX_DISABLED=1
+```eval_rst
+.. tabs::
+   .. tab:: Intel iGPU
+
+      .. code-block:: cmd
+
+         set SYCL_CACHE_PERSISTENT=1
+         set BIGDL_LLM_XMX_DISABLED=1
+
+   .. tab:: Intel Arc™ A300-Series or Pro A60
+
+      .. code-block:: cmd
+
+         set SYCL_CACHE_PERSISTENT=1
+
+   .. tab:: Other Intel dGPU Series
+
+      There is no need to set further environment variables.
 ```
 
 ```eval_rst
 .. note::
 
-   For the first time that **each model** runs on **iGPU**, it may take around several minutes to compile.
+   For **the first time** that **each model** runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 ```
 
-<!-- ### Troubleshooting -->
+### Troubleshooting
 
-<!-- todo -->
+#### 1. Error loading `intel_extension_for_pytorch`
+
+If you met error when importing `intel_extension_for_pytorch`, please ensure that you have completed the following steps:
+
+* Ensure that you have installed Visual Studio with "Desktop development with C++" workload.
+
+* Make sure that the correct version of oneAPI, specifically 2024.0, is installed.
+
+* Ensure that `libuv` is installed in your conda environment. This can be done during the creation of the environment with the command:
+  ```cmd
+  conda create -n llm python=3.9 libuv
+  ```
+  If you missed `libuv`, you can add it to your existing environment through
+  ```cmd
+  conda install libuv
+  ```
+
+* Make sure you have configured oneAPI environment variables in your command prompt through
+  ```cmd
+  call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
+  ```
+  Please note that you need to set these environment variables again once you have a new command prompt window.
 
 ## Linux
 
@@ -131,7 +173,7 @@ BigDL-LLM for GPU supports on Linux has been verified on:
 
            We recommend you to use `this offline package <https://registrationcenter-download.intel.com/akdlm/IRC_NAS/20f4e6a1-6b0b-4752-b8c1-e5eacba10e01/l_BaseKit_p_2024.0.0.49564_offline.sh>`_ to install oneapi.
 
-   .. tab:: Pytorch 2.0
+   .. tab:: PyTorch 2.0
 
       To enable BigDL-LLM for Intel GPUs with PyTorch 2.0, here're several prerequisite steps for tools installation and environment preparation:
 
@@ -164,7 +206,7 @@ We recommend using [miniconda](https://docs.conda.io/en/latest/miniconda.html) t
 
 ```eval_rst
 .. tabs::
-   .. tab:: Pytorch 2.1
+   .. tab:: PyTorch 2.1
 
       .. code-block:: bash
 
@@ -182,7 +224,7 @@ We recommend using [miniconda](https://docs.conda.io/en/latest/miniconda.html) t
             pip install --pre --upgrade bigdl-llm[xpu_2.1] -f https://developer.intel.com/ipex-whl-stable-xpu
             
 
-   .. tab:: Pytorch 2.0
+   .. tab:: PyTorch 2.0
 
       .. code-block:: bash
 
@@ -243,6 +285,12 @@ If you encounter network issues when installing IPEX, you can also install BigDL
 
 ```
 
+```eval_rst
+.. note::
+
+   All the wheel packages mentioned here are for Python 3.9. If you would like to use Python 3.10 or 3.11, you should modify the wheel names for ``torch``, ``torchvision``, and ``intel_extension_for_pytorch`` by replacing ``cp39`` with ``cp310`` or ``cp311``, respectively.
+```
+
 ### Runtime Configuration
 
 To use GPU acceleration on Linux, several environment variables are required or recommended before running a GPU example.
@@ -255,7 +303,7 @@ To use GPU acceleration on Linux, several environment variables are required or
 
       .. code-block:: bash
 
-         # Required step. Configure OneAPI environment variables
+         # Required step. Configure oneAPI environment variables
          source /opt/intel/oneapi/setvars.sh
 
          # Recommended Environment Variables
@@ -268,7 +316,7 @@ To use GPU acceleration on Linux, several environment variables are required or
 
       .. code-block:: bash
 
-         # Required step. Configure OneAPI environment variables
+         # Required step. Configure oneAPI environment variables
          source /opt/intel/oneapi/setvars.sh
 
          # Recommended Environment Variables
@@ -316,4 +364,4 @@ Error: libmkl_sycl_blas.so.4: cannot open shared object file: No such file or di
 The reason for such errors is that oneAPI has not been initialized properly before running BigDL-LLM code or before importing IPEX package.
 
 * Step 1: Make sure you execute setvars.sh of oneAPI Base Toolkit before running BigDL-LLM code.
-* Step 2: Make sure you install matching versions of BigDL-LLM/pytorch/IPEX and oneAPI Base Toolkit. BigDL-LLM with Pytorch 2.1 should be used with oneAPI Base Toolkit version 2024.0. BigDL-LLM with Pytorch 2.0 should be used with oneAPI Base Toolkit version 2023.2.
\ No newline at end of file
+* Step 2: Make sure you install matching versions of BigDL-LLM/pytorch/IPEX and oneAPI Base Toolkit. BigDL-LLM with PyTorch 2.1 should be used with oneAPI Base Toolkit version 2024.0. BigDL-LLM with PyTorch 2.0 should be used with oneAPI Base Toolkit version 2023.2.
diff --git a/python/llm/src/bigdl/llm/optimize.py b/python/llm/src/bigdl/llm/optimize.py
index 71fce857..75b1760b 100644
--- a/python/llm/src/bigdl/llm/optimize.py
+++ b/python/llm/src/bigdl/llm/optimize.py
@@ -199,15 +199,19 @@ def optimize_model(model, low_bit='sym_int4', optimize_llm=True, modules_to_not_
     A method to optimize any pytorch model.
 
     :param model: The original PyTorch model (nn.module)
-    :param low_bit: Supported low-bit options are "sym_int4", "asym_int4", "sym_int5",
-        "asym_int5" or "sym_int8".
-    :param optimize_llm: Whether to further optimize llm model.
+    :param low_bit: str value, options are ``'sym_int4'``, ``'asym_int4'``, ``'sym_int5'``,
+                    ``'asym_int5'``, ``'sym_int8'``, ``'nf3'``, ``'nf4'``, ``'fp4'``,
+                    ``'fp8'``, ``'fp8_e4m3'``, ``'fp8_e5m2'``, ``'fp16'`` or ``'bf16'``,
+                    ``'sym_int4'`` means symmetric int 4, ``'asym_int4'`` means
+                    asymmetric int 4, ``'nf4'`` means 4-bit NormalFloat, etc.
+                    Relevant low bit optimizations will be applied to the model.
+    :param optimize_llm: Whether to further optimize llm model. Default to be ``True``.
     :param modules_to_not_convert: list of str value, modules (nn.Module) that are skipped
-        when conducting model optimizations. Default to be None.
+        when conducting model optimizations. Default to be ``None``.
     :param cpu_embedding: Whether to replace the Embedding layer, may need to set it
-        to `True` when running BigDL-LLM on GPU on Windows. Default to be `False`.
+        to ``True`` when running BigDL-LLM on GPU on Windows. Default to be ``False``.
     :param lightweight_bmm: Whether to replace the torch.bmm ops, may need to set it
-        to `True` when running BigDL-LLM on GPU on Windows. Default to be `False`.
+        to ``True`` when running BigDL-LLM on GPU on Windows. Default to be ``False``.
 
     :return: The optimized model.
 
diff --git a/python/llm/src/bigdl/llm/transformers/model.py b/python/llm/src/bigdl/llm/transformers/model.py
index 1cf21936..f9af3faa 100644
--- a/python/llm/src/bigdl/llm/transformers/model.py
+++ b/python/llm/src/bigdl/llm/transformers/model.py
@@ -101,22 +101,24 @@ class _BaseAutoModelClass:
         Three new arguments are added to extend Hugging Face's from_pretrained method as follows:
 
         :param load_in_4bit: boolean value, True means loading linear's weight to symmetric int 4 if
-                                the model is a regular fp16/bf16/fp32 model, and to asymmetric int 4
-                                if the model is GPTQ model.
-                             Default to be False.
-        :param load_in_low_bit: str value, options are sym_int4, asym_int4, sym_int5, asym_int5
-                                , sym_int8, nf3, nf4, fp4, fp8, fp8_e4m3, fp8_e5m2, fp16 or bf16.
-                                sym_int4 means symmetric int 4, asym_int4 means asymmetric int 4,
-                                nf4 means 4-bit NormalFloat, etc. Relevant low bit optimizations
-                                will be applied to the model.
+                             the model is a regular fp16/bf16/fp32 model, and to asymmetric int 4
+                             if the model is GPTQ model.
+                             Default to be ``False``.
+        :param load_in_low_bit: str value, options are ``'sym_int4'``, ``'asym_int4'``,
+                                ``'sym_int5'``, ``'asym_int5'``, ``'sym_int8'``, ``'nf3'``,
+                                ``'nf4'``, ``'fp4'``, ``'fp8'``, ``'fp8_e4m3'``, ``'fp8_e5m2'``,
+                                ``'fp16'`` or ``'bf16'``, ``'sym_int4'`` means symmetric int 4,
+                                ``'asym_int4'`` means asymmetric int 4, ``'nf4'`` means 4-bit
+                                NormalFloat, etc. Relevant low bit optimizations will be applied
+                                to the model.
         :param optimize_model: boolean value, Whether to further optimize the low_bit llm model.
-                               Default to be True.
+                               Default to be ``True``.
         :param modules_to_not_convert: list of str value, modules (nn.Module) that are skipped when
-                                       conducting model optimizations. Default to be None.
+                                       conducting model optimizations. Default to be ``None``.
         :param cpu_embedding: Whether to replace the Embedding layer, may need to set it
-            to `True` when running BigDL-LLM on GPU on Windows. Default to be `False`.
+            to ``True`` when running BigDL-LLM on GPU on Windows. Default to be ``False``.
         :param lightweight_bmm: Whether to replace the torch.bmm ops, may need to set it
-            to `True` when running BigDL-LLM on GPU on Windows. Default to be `False`.
+            to ``True`` when running BigDL-LLM on GPU on Windows. Default to be ``False``.
         :return: a model instance
         """
         pretrained_model_name_or_path = kwargs.get("pretrained_model_name_or_path", None) \