Add initial md docs (#11371)

2024-06-20 13:47:49 +08:00 · 2024-06-20 13:47:49 +08:00 · 769728c1eb
commit 769728c1eb
parent 9601fae5d5
47 changed files with 6406 additions and 0 deletions
--- a/docs/mddocs/DockerGuides/docker_cpp_xpu_quickstart.md
+++ b/docs/mddocs/DockerGuides/docker_cpp_xpu_quickstart.md
@ -0,0 +1,221 @@
 ## Run llama.cpp/Ollama/Open-WebUI on an Intel GPU via Docker
 ## Quick Start
 ### Install Docker
 1. Linux Installation
    Follow the instructions in this [guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/docker_windows_gpu.html#linux) to install Docker on Linux.
 2. Windows Installation
    For Windows installation, refer to this [guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/docker_windows_gpu.html#install-docker-desktop-for-windows).
 #### Setting Docker on windows
 Need to enable `--net=host`,follow [this guide](https://docs.docker.com/network/drivers/host/#docker-desktop) so that you can easily access the service running on the docker. The [v6.1x kernel version wsl]( https://learn.microsoft.com/en-us/community/content/wsl-user-msft-kernel-v6#1---building-the-microsoft-linux-kernel-v61x) is recommended to use.Otherwise, you may encounter the blocking issue before loading the model to GPU.
 ### Pull the latest image
 ```bash
 # This image will be updated every day
 docker pull intelanalytics/ipex-llm-inference-cpp-xpu:latest
 ```
 ### Start Docker Container
 ```eval_rst
 .. tabs::
   .. tab:: Linux
      To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. Select the device you are running(device type:(Max, Flex, Arc, iGPU)). And change the `/path/to/models` to mount the models. `bench_model` is used to benchmark quickly. If want to benchmark, make sure it on the `/path/to/models`
      .. code-block:: bash
        #/bin/bash
        export DOCKER_IMAGE=intelanalytics/ipex-llm-inference-cpp-xpu:latest
        export CONTAINER_NAME=ipex-llm-inference-cpp-xpu-container
        sudo docker run -itd \
                --net=host \
                --device=/dev/dri \
                -v /path/to/models:/models \
                -e no_proxy=localhost,127.0.0.1 \
                --memory="32G" \
                --name=$CONTAINER_NAME \
                -e bench_model="mistral-7b-v0.1.Q4_0.gguf" \
                -e DEVICE=Arc \
                --shm-size="16g" \
                $DOCKER_IMAGE
   .. tab:: Windows
      To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. And change the `/path/to/models` to mount the models. Then add `--privileged` and map the `/usr/lib/wsl` to the docker.
      .. code-block:: bash
        #/bin/bash
        export DOCKER_IMAGE=intelanalytics/ipex-llm-inference-cpp-xpu:latest
        export CONTAINER_NAME=ipex-llm-inference-cpp-xpu-container
        sudo docker run -itd \
                --net=host \
                --device=/dev/dri \
                --privileged \
                -v /path/to/models:/models \
                -v /usr/lib/wsl:/usr/lib/wsl \
                -e no_proxy=localhost,127.0.0.1 \
                --memory="32G" \
                --name=$CONTAINER_NAME \
                -e bench_model="mistral-7b-v0.1.Q4_0.gguf" \
                -e DEVICE=Arc \
                --shm-size="16g" \
                $DOCKER_IMAGE
 ```
 After the container is booted, you could get into the container through `docker exec`.
 ```bash
 docker exec -it ipex-llm-inference-cpp-xpu-container /bin/bash
 ```
 To verify the device is successfully mapped into the container, run `sycl-ls` to check the result. In a machine with Arc A770, the sampled output is:
 ```bash
 root@arda-arc12:/# sycl-ls
 [opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
 [opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K 3.0 [2023.16.7.0.21_160000]
 [opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33]
 [ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
 ```
 ### Quick benchmark for llama.cpp
 Notice that the performance on windows wsl docker is a little slower than on windows host, ant it's caused by the implementation of wsl kernel.
 ```bash
 bash /llm/scripts/benchmark_llama-cpp.sh
 ```
 The benchmark will run three times to warm up to get the accurate results, and the example output is like:
 ```bash
 llama_print_timings:        load time =    xxx ms
 llama_print_timings:      sample time =       xxx ms /    128 runs   (    xxx ms per token, xxx tokens per second)
 llama_print_timings: prompt eval time =     xxx ms /    xxx tokens (    xxx ms per token,   xxx tokens per second)
 llama_print_timings:        eval time =     xxx ms /    127 runs   (   xxx ms per token,    xxx tokens per second)
 llama_print_timings:       total time =     xxx ms /    xxx tokens
 ```
 ### Running llama.cpp inference with IPEX-LLM on Intel GPU
 ```bash
 cd /llm/scripts/
 # set the recommended Env
 source ipex-llm-init --gpu --device $DEVICE
 # mount models and change the model_path in `start-llama-cpp.sh`
 bash start-llama-cpp.sh
 ```
 The example output is like:
 ```bash
 llama_print_timings:        load time =    xxx ms
 llama_print_timings:      sample time =       xxx ms /    32 runs   (    xxx ms per token, xxx tokens per second)
 llama_print_timings: prompt eval time =     xxx ms /    xxx tokens (    xxx ms per token,   xxx tokens per second)
 llama_print_timings:        eval time =     xxx ms /    31 runs   (   xxx ms per token,    xxx tokens per second)
 llama_print_timings:       total time =     xxx ms /    xxx tokens
 ```
 Please refer to this [documentation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html) for more details.
 ### Running Ollama serving with IPEX-LLM on Intel GPU
 Running the ollama on the background, you can see the ollama.log in `/root/ollama/ollama.log`
 ```bash
 cd /llm/scripts/
 # set the recommended Env
 source ipex-llm-init --gpu --device $DEVICE
 bash start-ollama.sh # ctrl+c to exit, and the ollama serve will run on the background
 ```
 Sample output:
 ```bash
 time=2024-05-16T10:45:33.536+08:00 level=INFO source=images.go:697 msg="total blobs: 0"
 time=2024-05-16T10:45:33.536+08:00 level=INFO source=images.go:704 msg="total unused blobs removed: 0"
 time=2024-05-16T10:45:33.536+08:00 level=INFO source=routes.go:1044 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
 time=2024-05-16T10:45:33.537+08:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama751325299/runners
 time=2024-05-16T10:45:33.565+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]"
 time=2024-05-16T10:45:33.565+08:00 level=INFO source=gpu.go:122 msg="Detecting GPUs"
 time=2024-05-16T10:45:33.566+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
 ```
 #### Run Ollama models (interactive)
 ```bash
 cd /llm/ollama
 # create a file named Modelfile
 FROM /models/mistral-7b-v0.1.Q4_0.gguf
 TEMPLATE [INST] {{ .Prompt }} [/INST]
 PARAMETER num_predict 64
 # create example and run it on console
 ./ollama create example -f Modelfile
 ./ollama run example
 ```
 An example process of interacting with model with `ollama run example` looks like the following:
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/ollama_gguf_demo_image.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/ollama_gguf_demo_image.png" width=100%; />
 </a>
 #### Pull models from ollama to serve
 ```bash
 cd /llm/ollama
 ./ollama pull llama2
 ```
 Use the Curl to Test:
 ```bash
 curl http://localhost:11434/api/generate -d '
 { 
   "model": "llama2", 
   "prompt": "What is AI?", 
   "stream": false
 }'
 ```
 Sample output:
 ```bash
 {"model":"llama2","created_at":"2024-05-16T02:52:18.972296097Z","response":"\nArtificial intelligence (AI) refers to the development of computer systems that can perform tasks that typically require human intelligence, such as learning, problem-solving, and decision-making. AI systems use algorithms and data to mimic human behavior and perform tasks such as:\n\n1. Image recognition: AI can identify objects in images and classify them into different categories.\n2. Natural Language Processing (NLP): AI can understand and generate human language, allowing it to interact with humans through voice assistants or chatbots.\n3. Predictive analytics: AI can analyze data to make predictions about future events, such as stock prices or weather patterns.\n4. Robotics: AI can control robots that perform tasks such as assembly, maintenance, and logistics.\n5. Recommendation systems: AI can suggest products or services based on a user's past behavior or preferences.\n6. Autonomous vehicles: AI can control self-driving cars that can navigate through roads and traffic without human intervention.\n7. Fraud detection: AI can identify and flag fraudulent transactions, such as credit card purchases or insurance claims.\n8. Personalized medicine: AI can analyze genetic data to provide personalized medical recommendations, such as drug dosages or treatment plans.\n9. Virtual assistants: AI can interact with users through voice or text interfaces, providing information or completing tasks.\n10. Sentiment analysis: AI can analyze text or speech to determine the sentiment or emotional tone of a message.\n\nThese are just a few examples of what AI can do. As the technology continues to evolve, we can expect to see even more innovative applications of AI in various industries and aspects of our lives.","done":true,"context":[xxx,xxx],"total_duration":12831317190,"load_duration":6453932096,"prompt_eval_count":25,"prompt_eval_duration":254970000,"eval_count":390,"eval_duration":6079077000}
 ```
 Please refer to this [documentation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/ollama_quickstart.html#pull-model) for more details.
 ### Running Open WebUI with Intel GPU
 Start the ollama and load the model first, then use the open-webui to chat.
 If you have difficulty accessing the huggingface repositories, you may use a mirror, e.g. add export HF_ENDPOINT=https://hf-mirror.com before running bash start.sh.
 ```bash
 cd /llm/scripts/
 bash start-open-webui.sh
 ```
 Sample output:
 ```bash
 INFO:     Started server process [1055]
 INFO:     Waiting for application startup.
 INFO:     Application startup complete.
 INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
 ```
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_signup.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_signup.png" width="100%" />
 </a>
 For how to log-in or other guide, Please refer to this [documentation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/open_webui_with_ollama_quickstart.html) for more details.
--- a/docs/mddocs/DockerGuides/docker_pytorch_inference_gpu.md
+++ b/docs/mddocs/DockerGuides/docker_pytorch_inference_gpu.md
@ -0,0 +1,171 @@
 # Python Inference using IPEX-LLM on Intel GPU
 We can run PyTorch Inference Benchmark, Chat Service and PyTorch Examples on Intel GPUs within Docker (on Linux or WSL).
 ```eval_rst
 .. note::
   The current Windows + WSL + Docker solution only supports Arc series dGPU. For Windows users with MTL iGPU, it is recommended to install directly via pip install in Miniforge Prompt. Refer to `this guide <https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_windows_gpu.html>`_.
 ```
 ## Install Docker
 Follow the [Docker installation Guide](./docker_windows_gpu.html#install-docker) to install docker on either Linux or Windows.
 ## Launch Docker
 Prepare ipex-llm-xpu Docker Image:
 ```bash
 docker pull intelanalytics/ipex-llm-xpu:latest
 ```
 Start ipex-llm-xpu Docker Container:
 ```eval_rst
 .. tabs::
   .. tab:: Linux
      .. code-block:: bash
        export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest
        export CONTAINER_NAME=my_container
        export MODEL_PATH=/llm/models[change to your model path]
        docker run -itd \
            --net=host \
            --device=/dev/dri \
            --memory="32G" \
            --name=$CONTAINER_NAME \
            --shm-size="16g" \
            -v $MODEL_PATH:/llm/models \
            $DOCKER_IMAGE
   .. tab:: Windows WSL
      .. code-block:: bash
         #/bin/bash
        export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest
        export CONTAINER_NAME=my_container
        export MODEL_PATH=/llm/models[change to your model path]
        sudo docker run -itd \
                --net=host \
                --privileged \
                --device /dev/dri \
                --memory="32G" \
                --name=$CONTAINER_NAME \
                --shm-size="16g" \
                -v $MODEL_PATH:/llm/llm-models \
                -v /usr/lib/wsl:/usr/lib/wsl \ 
                $DOCKER_IMAGE
 ```
 Access the container:
 ```
 docker exec -it $CONTAINER_NAME bash
 ```
 To verify the device is successfully mapped into the container, run `sycl-ls` to check the result. In a machine with Arc A770, the sampled output is:
 ```bash
 root@arda-arc12:/# sycl-ls
 [opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
 [opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K 3.0 [2023.16.7.0.21_160000]
 [opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33]
 [ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
 ```
 ```eval_rst
 .. tip::
  You can run the Env-Check script to verify your ipex-llm installation and runtime environment.
  .. code-block:: bash
     cd /ipex-llm/python/llm/scripts
     bash env-check.sh
 ```
 ## Run Inference Benchmark 
 Navigate to benchmark directory, and modify the `config.yaml` under the `all-in-one` folder for benchmark configurations.
 ```bash
 cd /benchmark/all-in-one
 vim config.yaml
 ```
 In the `config.yaml`, change `repo_id` to the model you want to test and `local_model_hub` to point to your model hub path. 
 ```yaml
 ...
 repo_id:
  - 'meta-llama/Llama-2-7b-chat-hf'
 local_model_hub: '/path/to/your/mode/folder'
 ...
 ``` 
 After modifying `config.yaml`, run the following commands to run benchmarking:
 ```bash
 source ipex-llm-init --gpu --device <value>
 python run.py
 ```
 **Result Interpretation**
 After the benchmarking is completed, you can obtain a CSV result file under the current folder. You can mainly look at the results of columns `1st token avg latency (ms)` and `2+ avg latency (ms/token)` for the benchmark results. You can also check whether the column `actual input/output tokens` is consistent with the column `input/output tokens` and whether the parameters you specified in `config.yaml` have been successfully applied in the benchmarking.
 ## Run Chat Service
 We provide `chat.py` for conversational AI. 
 For example, if your model is Llama-2-7b-chat-hf and mounted on /llm/models, you can execute the following command to initiate a conversation:
  ```bash
  cd /llm
  python chat.py --model-path /llm/models/Llama-2-7b-chat-hf
  ```
 Here is a demostration:
 <a align="left"  href="https://llm-assets.readthedocs.io/en/latest/_images/llm-inference-cpu-docker-chatpy-demo.gif">
            <img src="https://llm-assets.readthedocs.io/en/latest/_images/llm-inference-cpu-docker-chatpy-demo.gif" width='60%' /> 
 </a><br>
 ## Run PyTorch Examples
 We provide several PyTorch examples that you could apply IPEX-LLM INT4 optimizations on models on Intel GPUs
 For example, if your model is Llama-2-7b-chat-hf and mounted on /llm/models, you can navigate to /examples/llama2 directory, excute the following command to run example:
  ```bash
  cd /examples/<model_dir>
  python ./generate.py --repo-id-or-model-path /llm/models/Llama-2-7b-chat-hf --prompt PROMPT --n-predict N_PREDICT
  ```
 Arguments info:
 - `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Llama2 model (e.g. `meta-llama/Llama-2-7b-chat-hf` and `meta-llama/Llama-2-13b-chat-hf`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'meta-llama/Llama-2-7b-chat-hf'`.
 - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`.
 - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 **Sample Output**
 ```log
 Inference time: xxxx s
 -------------------- Prompt --------------------
 <s>[INST] <<SYS>>
 <</SYS>>
 What is AI? [/INST]
 -------------------- Output --------------------
 [INST] <<SYS>>
 <</SYS>>
 What is AI? [/INST]  Artificial intelligence (AI) is the broader field of research and development aimed at creating machines that can perform tasks that typically require human intelligence,
 ```
--- a/docs/mddocs/DockerGuides/docker_run_pytorch_inference_in_vscode.md
+++ b/docs/mddocs/DockerGuides/docker_run_pytorch_inference_in_vscode.md
@ -0,0 +1,139 @@
 # Run/Develop PyTorch in VSCode with Docker on Intel GPU
 An IPEX-LLM container is a pre-configured environment that includes all necessary dependencies for running LLMs on Intel GPUs. 
 This guide provides steps to run/develop PyTorch examples in VSCode with Docker on Intel GPUs.
 ```eval_rst
 .. note::
   This guide assumes you have already installed VSCode in your environment. 
   To run/develop on Windows, install VSCode and then follow the steps below. 
   To run/develop on Linux, you might open VSCode first and SSH to a remote Linux machine, then proceed with the following steps.
 ```
 ## Install Docker
 Follow the [Docker installation Guide](./docker_windows_gpu.html#install-docker) to install docker on either Linux or Windows.
 ## Install Extensions for VSCcode
 #### Install Dev Containers Extension
 For both Linux/Windows, you will need to Install Dev Containers extension.
 Open the Extensions view in VSCode (you can use the shortcut `Ctrl+Shift+X`), then search for and install the `Dev Containers` extension.
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/install_dev_container_extension_in_vscode.gif" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/install_dev_container_extension_in_vscode.gif" width=100%; />
 </a>
 #### Install WSL Extension for Windows
 For Windows, you will need to install wsl extension to to the WSL environment. Open the Extensions view in VSCode (you can use the shortcut `Ctrl+Shift+X`), then search for and install the `WSL` extension.
 Press F1 to bring up the Command Palette and type in `WSL: Connect to WSL Using Distro...` and select it and then select a specific WSL distro `Ubuntu`
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/install_wsl_extention_in_vscode.gif" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/install_wsl_extention_in_vscode.gif" width=100%; />
 </a>
 ## Launch Container
 Open the Terminal in VSCode (you can use the shortcut `` Ctrl+Shift+` ``), then pull ipex-llm-xpu Docker Image:
 ```bash
 docker pull intelanalytics/ipex-llm-xpu:latest
 ```
 Start ipex-llm-xpu Docker Container:
 ```eval_rst
 .. tabs::
   .. tab:: Linux
      .. code-block:: bash
        export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest
        export CONTAINER_NAME=my_container
        export MODEL_PATH=/llm/models[change to your model path]
        docker run -itd \
            --net=host \
            --device=/dev/dri \
            --memory="32G" \
            --name=$CONTAINER_NAME \
            --shm-size="16g" \
            -v $MODEL_PATH:/llm/models \
            $DOCKER_IMAGE
   .. tab:: Windows WSL
      .. code-block:: bash
         #/bin/bash
        export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:latest
        export CONTAINER_NAME=my_container
        export MODEL_PATH=/llm/models[change to your model path]
        sudo docker run -itd \
                --net=host \
                --privileged \
                --device /dev/dri \
                --memory="32G" \
                --name=$CONTAINER_NAME \
                --shm-size="16g" \
                -v $MODEL_PATH:/llm/llm-models \
                -v /usr/lib/wsl:/usr/lib/wsl \ 
                $DOCKER_IMAGE
 ```
 ## Run/Develop Pytorch Examples
 Press F1 to bring up the Command Palette and type in `Dev Containers: Attach to Running Container...` and select it and then select `my_container`
 Now you are in a running Docker Container, Open folder `/ipex-llm/python/llm/example/GPU/HF-Transformers-AutoModels/Model/`.
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/run_example_in_vscode.gif" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/run_example_in_vscode.gif" width=100%; />
 </a>
 In this folder, we provide several PyTorch examples that you could apply IPEX-LLM INT4 optimizations on models on Intel GPUs.
 For example, if your model is Llama-2-7b-chat-hf and mounted on /llm/models, you can navigate to llama2 directory, excute the following command to run example:
  ```bash
  cd <model_dir>
  python ./generate.py --repo-id-or-model-path /llm/models/Llama-2-7b-chat-hf --prompt PROMPT --n-predict N_PREDICT
  ```
 Arguments info:
 - `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Llama2 model (e.g. `meta-llama/Llama-2-7b-chat-hf` and `meta-llama/Llama-2-13b-chat-hf`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'meta-llama/Llama-2-7b-chat-hf'`.
 - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`.
 - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 **Sample Output**
 ```log
 Inference time: xxxx s
 -------------------- Prompt --------------------
 <s>[INST] <<SYS>>
 <</SYS>>
 What is AI? [/INST]
 -------------------- Output --------------------
 [INST] <<SYS>>
 <</SYS>>
 What is AI? [/INST]  Artificial intelligence (AI) is the broader field of research and development aimed at creating machines that can perform tasks that typically require human intelligence,
 ```
 You can develop your own PyTorch example based on these examples.
--- a/docs/mddocs/DockerGuides/docker_windows_gpu.md
+++ b/docs/mddocs/DockerGuides/docker_windows_gpu.md
@ -0,0 +1,111 @@
 # Overview of IPEX-LLM Containers for Intel GPU 
 An IPEX-LLM container is a pre-configured environment that includes all necessary dependencies for running LLMs on Intel GPUs. 
 This guide provides general instructions for setting up the IPEX-LLM Docker containers with Intel GPU. It begins with instructions and tips for Docker installation, and then introduce the available IPEX-LLM containers and their uses. 
 ## Install Docker
 ### Linux
 Follow the instructions in the [Offcial Docker Guide](https://www.docker.com/get-started/) to install Docker on Linux.
 ### Windows
 ```eval_rst
 .. tip::
   The installation requires at least 35GB of free disk space on C drive.
 ```
 ```eval_rst
 .. note::
   Detailed installation instructions for Windows, including steps for enabling WSL2, can be found on the [Docker Desktop for Windows installation page](https://docs.docker.com/desktop/install/windows-install/).
 ```
 #### Install Docker Desktop for Windows 
 Follow the instructions in [this guide](https://docs.docker.com/desktop/install/windows-install/) to install **Docker Desktop for Windows**. Restart you machine after the installation is complete.
 #### Install WSL2
 Follow the instructions in [this guide](https://docs.microsoft.com/en-us/windows/wsl/install) to install **Windows Subsystem for Linux 2 (WSL2)**.
 ```eval_rst
 .. tip::
  You may verify WSL2 installation by running the command `wsl --list` in PowerShell or Command Prompt. If WSL2 is installed, you will see a list of installed Linux distributions.
 ```
 #### Enable Docker integration with WSL2
 Open **Docker desktop**, and select `Settings`->`Resources`->`WSL integration`->turn on `Ubuntu` button->`Apply & restart`.
     <a href="https://llm-assets.readthedocs.io/en/latest/_images/docker_desktop_new.png">
       <img src="https://llm-assets.readthedocs.io/en/latest/_images/docker_desktop_new.png" width=100%; />
     </a>
 ```eval_rst
 .. tip::
   If you encounter **Docker Engine stopped** when opening Docker Desktop, you can reopen it in administrator mode.
 ```
 #### Verify Docker is enabled in WSL2
 Execute the following commands in PowerShell or Command Prompt to verify that Docker is enabled in WSL2:
 ```bash
 wsl -d Ubuntu # Run Ubuntu WSL distribution
 docker version # Check if Docker is enabled in WSL
 ```
 You can see the output similar to the following:
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/docker_wsl.png">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/docker_wsl.png" width=100%; />
 </a>
 ```eval_rst
 .. tip::
   During the use of Docker in WSL, Docker Desktop needs to be kept open all the time.
 ```
 ## IPEX-LLM Docker Containers
 We have several docker images available for running LLMs on Intel GPUs. The following table lists the available images and their uses:
 | Image Name | Description | Use Case |
 |------------|-------------|----------|
 | intelanalytics/ipex-llm-cpu:latest | CPU Inference |For development and running LLMs using llama.cpp, Ollama and Python|
 | intelanalytics/ipex-llm-xpu:latest | GPU Inference |For development and running LLMs using llama.cpp, Ollama and Python|
 | intelanalytics/ipex-llm-serving-cpu:latest | CPU Serving|For serving multiple users/requests through REST APIs using vLLM/FastChat|
 | intelanalytics/ipex-llm-serving-xpu:latest | GPU Serving|For serving multiple users/requests through REST APIs using vLLM/FastChat|
 | intelanalytics/ipex-llm-finetune-qlora-cpu-standalone:latest | CPU Finetuning via Docker|For fine-tuning LLMs using QLora/Lora, etc. |
 |intelanalytics/ipex-llm-finetune-qlora-cpu-k8s:latest|CPU Finetuning via Kubernetes|For fine-tuning LLMs using QLora/Lora, etc. |
 | intelanalytics/ipex-llm-finetune-qlora-xpu:latest| GPU Finetuning|For fine-tuning LLMs using QLora/Lora, etc.|
 We have also provided several quickstarts for various usage scenarios:
 - [Run and develop LLM applications in PyTorch](./docker_pytorch_inference_gpu.html)
 ... to be added soon.
 ## Troubleshooting
 If your machine has both an integrated GPU (iGPU) and a dedicated GPU (dGPU) like ARC, you may encounter the following issue:
 ```bash
 Abort was called at 62 line in file:
 ./shared/source/os_interface/os_interface.h
 LIBXSMM_VERSION: main_stable-1.17-3651 (25693763)
 LIBXSMM_TARGET: adl [Intel(R) Core(TM) i7-14700K]
 Registry and code: 13 MB
 Command: python chat.py --model-path /llm/llm-models/chatglm2-6b/
 Uptime: 29.349235 s
 Aborted
 ```
 To resolve this problem, you can disable the iGPU in Device Manager on Windows. For details, refer to [this guide](https://www.elevenforum.com/t/enable-or-disable-integrated-graphics-igpu-in-windows-11.18616/)
--- a/docs/mddocs/DockerGuides/fastchat_docker_quickstart.md
+++ b/docs/mddocs/DockerGuides/fastchat_docker_quickstart.md
@ -0,0 +1,117 @@
 # FastChat Serving with IPEX-LLM on Intel GPUs via docker
 This guide demonstrates how to run `FastChat` serving with `IPEX-LLM` on Intel GPUs via Docker.
 ## Install docker
 Follow the instructions in this [guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/docker_windows_gpu.html#linux) to install Docker on Linux.
 ## Pull the latest image
 ```bash
 # This image will be updated every day
 docker pull intelanalytics/ipex-llm-serving-xpu:latest
 ```
 ## Start Docker Container
 To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. Change the `/path/to/models` to mount the models. 
 ```
 #/bin/bash
 export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest
 export CONTAINER_NAME=ipex-llm-serving-xpu-container
 sudo docker run -itd \
        --net=host \
        --device=/dev/dri \
        -v /path/to/models:/llm/models \
        -e no_proxy=localhost,127.0.0.1 \
        --memory="32G" \
        --name=$CONTAINER_NAME \
        --shm-size="16g" \
        $DOCKER_IMAGE
 ```
 After the container is booted, you could get into the container through `docker exec`.
 ```bash
 docker exec -it ipex-llm-serving-xpu-container /bin/bash
 ```
 To verify the device is successfully mapped into the container, run `sycl-ls` to check the result. In a machine with Arc A770, the sampled output is:
 ```bash
 root@arda-arc12:/# sycl-ls
 [opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
 [opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K 3.0 [2023.16.7.0.21_160000]
 [opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33]
 [ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
 ```
 ## Running FastChat serving with IPEX-LLM on Intel GPU in Docker
 For convenience, we have provided a script named `/llm/start-fastchat-service.sh` for you to start the service.  
 However, the script only provide instructions for the most common scenarios. If this script doesn't meet your needs, you can always find the complete guidance for FastChat at [Serving using IPEX-LLM and FastChat](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/fastchat_quickstart.html#start-the-service).
 Before starting the service, you can refer to this [section](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#runtime-configurations) to setup our recommended runtime configurations.
 Now we can start the FastChat service, you can use our provided script `/llm/start-fastchat-service.sh` like the following way:
 ```bash
 # Only the MODEL_PATH needs to be set, other parameters have default values
 export MODEL_PATH=YOUR_SELECTED_MODEL_PATH
 export LOW_BIT_FORMAT=sym_int4
 export CONTROLLER_HOST=localhost
 export CONTROLLER_PORT=21001
 export WORKER_HOST=localhost
 export WORKER_PORT=21002
 export API_HOST=localhost
 export API_PORT=8000
 # Use the default model_worker
 bash /llm/start-fastchat-service.sh -w model_worker
 ```
 If everything goes smoothly, the result should be similar to the following figure:
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/start-fastchat.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/start-fastchat.png" width=100%; />
 </a>
 By default, we are using the `ipex_llm_worker` as the backend engine.  You can also use `vLLM` as the backend engine.  Try the following examples:
 ```bash
 # Only the MODEL_PATH needs to be set, other parameters have default values
 export MODEL_PATH=YOUR_SELECTED_MODEL_PATH
 export LOW_BIT_FORMAT=sym_int4
 export CONTROLLER_HOST=localhost
 export CONTROLLER_PORT=21001
 export WORKER_HOST=localhost
 export WORKER_PORT=21002
 export API_HOST=localhost
 export API_PORT=8000
 # Use the default model_worker
 bash /llm/start-fastchat-service.sh -w vllm_worker
 ```
 The `vllm_worker` may start slowly than normal `ipex_llm_worker`.  The booted service should be similar to the following figure:
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/fastchat-vllm-worker.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/fastchat-vllm-worker.png" width=100%; />
 </a>
 ```eval_rst
 .. note::
  To verify/use the service booted by the script, follow the instructions in `this guide <https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/fastchat_quickstart.html#launch-restful-api-serve>`_.
 ```
 After a request has been sent to the `openai_api_server`, the corresponding inference time latency can be found in the worker log as shown below:
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/fastchat-benchmark.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/fastchat-benchmark.png" width=100%; />
 </a>
--- a/docs/mddocs/DockerGuides/index.rst
+++ b/docs/mddocs/DockerGuides/index.rst
@ -0,0 +1,15 @@
 IPEX-LLM Docker Container User Guides
 =====================================
 In this section, you will find guides related to using IPEX-LLM with Docker, covering how to:
 * `Overview of IPEX-LLM Containers <./docker_windows_gpu.html>`_
 * Inference in Python/C++  
   * `GPU Inference in Python with IPEX-LLM <./docker_pytorch_inference_gpu.html>`_
   * `VSCode LLM Development with IPEX-LLM on Intel GPU <./docker_pytorch_inference_gpu.html>`_
   * `llama.cpp/Ollama/Open-WebUI with IPEX-LLM on Intel GPU <./docker_cpp_xpu_quickstart.html>`_
 * Serving
   * `FastChat with IPEX-LLM on Intel GPU <./fastchat_docker_quickstart.html>`_
   * `vLLM with IPEX-LLM on Intel GPU <./vllm_docker_quickstart.html>`_
   * `vLLM with IPEX-LLM on Intel CPU <./vllm_cpu_docker_quickstart.html>`_
--- a/docs/mddocs/DockerGuides/vllm_cpu_docker_quickstart.md
+++ b/docs/mddocs/DockerGuides/vllm_cpu_docker_quickstart.md
@ -0,0 +1,118 @@
 # vLLM Serving with IPEX-LLM on Intel CPU via Docker
 This guide demonstrates how to run `vLLM` serving with `ipex-llm` on Intel CPU via Docker.
 ## Install docker
 Follow the instructions in this [guide](https://www.docker.com/get-started/) to install Docker on Linux.
 ## Pull the latest image
 *Note: For running vLLM serving on Intel CPUs, you can currently use either the `intelanalytics/ipex-llm-serving-cpu:latest` or `intelanalytics/ipex-llm-serving-vllm-cpu:latest` Docker image.*
 ```bash
 # This image will be updated every day
 docker pull intelanalytics/ipex-llm-serving-cpu:latest
 ```
 ## Start Docker Container
 To fully use your Intel CPU to run vLLM inference and serving, you should 
 ```
 #/bin/bash
 export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-cpu:latest
 export CONTAINER_NAME=ipex-llm-serving-cpu-container
 sudo docker run -itd \
        --net=host \
        --cpuset-cpus="0-47" \
        --cpuset-mems="0" \
        -v /path/to/models:/llm/models \
        -e no_proxy=localhost,127.0.0.1 \
        --memory="64G" \
        --name=$CONTAINER_NAME \
        --shm-size="16g" \
        $DOCKER_IMAGE
 ```
 After the container is booted, you could get into the container through `docker exec`.
 ```bash
 docker exec -it ipex-llm-serving-cpu-container /bin/bash
 ```
 ## Running vLLM serving with IPEX-LLM on Intel CPU in Docker
 We have included multiple vLLM-related files in `/llm/`:
 1. `vllm_offline_inference.py`: Used for vLLM offline inference example
 2. `benchmark_vllm_throughput.py`: Used for benchmarking throughput
 3. `payload-1024.lua`: Used for testing request per second using 1k-128 request
 4. `start-vllm-service.sh`: Used for template for starting vLLM service
 Before performing benchmark or starting the service, you can refer to this [section](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_cpu.html#environment-setup) to setup our recommended runtime configurations.
 ### Service
 A script named `/llm/start-vllm-service.sh` have been included in the image for starting the service conveniently.
 Modify the `model` and `served_model_name` in the script so that it fits your requirement. The `served_model_name` indicates the model name used in the API. 
 Then start the service using `bash /llm/start-vllm-service.sh`, the following message should be print if the service started successfully.
 If the service have booted successfully, you should see the output similar to the following figure:
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/start-vllm-service.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/start-vllm-service.png" width=100%; />
 </a>
 #### Verify
 After the service has been booted successfully, you can send a test request using `curl`. Here, `YOUR_MODEL` should be set equal to `served_model_name` in your booting script, e.g. `Qwen1.5`.
 ```bash
 curl http://localhost:8000/v1/completions \
 -H "Content-Type: application/json" \
 -d '{
  "model": "YOUR_MODEL",
  "prompt": "San Francisco is a",
  "max_tokens": 128,
  "temperature": 0
 }' | jq '.choices[0].text'
 ```
 Below shows an example output using `Qwen1.5-7B-Chat` with low-bit format `sym_int4`:
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/vllm-curl-result.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/vllm-curl-result.png" width=100%; />
 </a>
 #### Tuning
 You can tune the service using these four arguments:
 - `--max-model-len`
 - `--max-num-batched-token`
 - `--max-num-seq`
 You can refer to this [doc](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#service) for a detailed explaination on these parameters.
 ### Benchmark
 #### Online benchmark throurgh api_server
 We can benchmark the api_server to get an estimation about TPS (transactions per second).  To do so, you need to start the service first according to the instructions mentioned above.
 Then in the container, do the following:
 1. modify the `/llm/payload-1024.lua` so that the "model" attribute is correct.  By default, we use a prompt that is roughly 1024 token long, you can change it if needed.
 2. Start the benchmark using `wrk` using the script below:
 ```bash
 cd /llm
 # warmup
 wrk -t4 -c4 -d3m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h
 # You can change -t and -c to control the concurrency.
 # By default, we use 8 connections to benchmark the service.
 wrk -t8 -c8 -d15m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h
 ```
 #### Offline benchmark through benchmark_vllm_throughput.py
 Please refer to this [section](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#performing-benchmark) on how to use `benchmark_vllm_throughput.py` for benchmarking.
--- a/docs/mddocs/DockerGuides/vllm_docker_quickstart.md
+++ b/docs/mddocs/DockerGuides/vllm_docker_quickstart.md
@ -0,0 +1,146 @@
 # vLLM Serving with IPEX-LLM on Intel GPUs via Docker
 This guide demonstrates how to run `vLLM` serving with `IPEX-LLM` on Intel GPUs via Docker.
 ## Install docker
 Follow the instructions in this [guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/DockerGuides/docker_windows_gpu.html#linux) to install Docker on Linux.
 ## Pull the latest image
 *Note: For running vLLM serving on Intel GPUs, you can currently use either the `intelanalytics/ipex-llm-serving-xpu:latest` or `intelanalytics/ipex-llm-serving-vllm-xpu:latest` Docker image.*
 ```bash
 # This image will be updated every day
 docker pull intelanalytics/ipex-llm-serving-xpu:latest
 ```
 ## Start Docker Container
 To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. Change the `/path/to/models` to mount the models. 
 ```
 #/bin/bash
 export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest
 export CONTAINER_NAME=ipex-llm-serving-xpu-container
 sudo docker run -itd \
        --net=host \
        --device=/dev/dri \
        -v /path/to/models:/llm/models \
        -e no_proxy=localhost,127.0.0.1 \
        --memory="32G" \
        --name=$CONTAINER_NAME \
        --shm-size="16g" \
        $DOCKER_IMAGE
 ```
 After the container is booted, you could get into the container through `docker exec`.
 ```bash
 docker exec -it ipex-llm-serving-xpu-container /bin/bash
 ```
 To verify the device is successfully mapped into the container, run `sycl-ls` to check the result. In a machine with Arc A770, the sampled output is:
 ```bash
 root@arda-arc12:/# sycl-ls
 [opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
 [opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K 3.0 [2023.16.7.0.21_160000]
 [opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33]
 [ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
 ```
 ## Running vLLM serving with IPEX-LLM on Intel GPU in Docker
 We have included multiple vLLM-related files in `/llm/`:
 1. `vllm_offline_inference.py`: Used for vLLM offline inference example
 2. `benchmark_vllm_throughput.py`: Used for benchmarking throughput
 3. `payload-1024.lua`: Used for testing request per second using 1k-128 request
 4. `start-vllm-service.sh`: Used for template for starting vLLM service
 Before performing benchmark or starting the service, you can refer to this [section](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#runtime-configurations) to setup our recommended runtime configurations.
 ### Service
 #### Single card serving
 A script named `/llm/start-vllm-service.sh` have been included in the image for starting the service conveniently.
 Modify the `model` and `served_model_name` in the script so that it fits your requirement. The `served_model_name` indicates the model name used in the API. 
 Then start the service using `bash /llm/start-vllm-service.sh`, the following message should be print if the service started successfully.
 If the service have booted successfully, you should see the output similar to the following figure:
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/start-vllm-service.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/start-vllm-service.png" width=100%; />
 </a>
 #### Multi-card serving
 vLLM supports to utilize multiple cards through tensor parallel. 
 You can refer to this [documentation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#about-tensor-parallel) on how to utilize the `tensor-parallel` feature and start the service.
 #### Verify
 After the service has been booted successfully, you can send a test request using `curl`. Here, `YOUR_MODEL` should be set equal to `served_model_name` in your booting script, e.g. `Qwen1.5`.
 ```bash
 curl http://localhost:8000/v1/completions \
 -H "Content-Type: application/json" \
 -d '{
  "model": "YOUR_MODEL",
  "prompt": "San Francisco is a",
  "max_tokens": 128,
  "temperature": 0
 }' | jq '.choices[0].text'
 ```
 Below shows an example output using `Qwen1.5-7B-Chat` with low-bit format `sym_int4`:
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/vllm-curl-result.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/vllm-curl-result.png" width=100%; />
 </a>
 #### Tuning
 You can tune the service using these four arguments:
 - `--gpu-memory-utilization`
 - `--max-model-len`
 - `--max-num-batched-token`
 - `--max-num-seq`
 You can refer to this [doc](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#service) for a detailed explaination on these parameters.
 ### Benchmark
 #### Online benchmark throurgh api_server
 We can benchmark the api_server to get an estimation about TPS (transactions per second).  To do so, you need to start the service first according to the instructions mentioned above.
 Then in the container, do the following:
 1. modify the `/llm/payload-1024.lua` so that the "model" attribute is correct.  By default, we use a prompt that is roughly 1024 token long, you can change it if needed.
 2. Start the benchmark using `wrk` using the script below:
 ```bash
 cd /llm
 # warmup due to JIT compliation
 wrk -t4 -c4 -d3m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h
 # You can change -t and -c to control the concurrency.
 # By default, we use 12 connections to benchmark the service.
 wrk -t12 -c12 -d15m -s payload-1024.lua http://localhost:8000/v1/completions --timeout 1h
 ```
 The following figure shows performing benchmark on `Llama-2-7b-chat-hf` using the above script:
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/service-benchmark-result.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/service-benchmark-result.png" width=100%; />
 </a>
 #### Offline benchmark through benchmark_vllm_throughput.py
 Please refer to this [section](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html#performing-benchmark) on how to use `benchmark_vllm_throughput.py` for benchmarking.
--- a/docs/mddocs/Inference/Self_Speculative_Decoding.md
+++ b/docs/mddocs/Inference/Self_Speculative_Decoding.md
@ -0,0 +1,23 @@
 # Self-Speculative Decoding
 ### Speculative Decoding in Practice
 In [speculative](https://arxiv.org/abs/2302.01318) [decoding](https://arxiv.org/abs/2211.17192), a small (draft) model quickly generates multiple draft tokens, which are then verified in parallel by the large (target) model. While speculative decoding can effectively speed up the target model, ***in practice it is difficult to maintain or even obtain a proper draft model***, especially when the target model is finetuned with customized data. 
 ### Self-Speculative Decoding 
 Built on top of the concept of “[self-speculative decoding](https://arxiv.org/abs/2309.08168)”, IPEX-LLM can now accelerate the original FP16 or BF16 model ***without the need of a separate draft model or model finetuning***; instead, it automatically converts the original model to INT4, and uses the INT4 model as the draft model behind the scene. In practice, this brings ***~30% speedup*** for FP16 and BF16 LLM inference latency on Intel GPU and CPU respectively.
 ### Using IPEX-LLM Self-Speculative Decoding
 Please refer to IPEX-LLM self-speculative decoding code snippets below, and the detailed [GPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/Speculative-Decoding) and [CPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/Speculative-Decoding) examples in the project repo.
 ```python
 model = AutoModelForCausalLM.from_pretrained(model_path,
           optimize_model=True,
           torch_dtype=torch.float16, #use bfloat16 on cpu
           load_in_low_bit="fp16", #use bf16 on cpu
           speculative=True, #set speculative to true
           trust_remote_code=True,
           use_cache=True)
 output = model.generate(input_ids,
           max_new_tokens=args.n_predict,
           do_sample=False)          
 ```
--- a/docs/mddocs/Overview/FAQ/faq.md
+++ b/docs/mddocs/Overview/FAQ/faq.md
@ -0,0 +1,79 @@
 # Frequently Asked Questions (FAQ)
 ## General Info & Concepts
 ### GGUF format usage with IPEX-LLM?
 IPEX-LLM supports running GGUF/AWQ/GPTQ models on both [CPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Advanced-Quantizations) and [GPU](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations).
 Please also refer to [here](https://github.com/intel-analytics/ipex-llm?tab=readme-ov-file#latest-update-) for our latest support.
 ## How to Resolve Errors
 ### Fail to install `ipex-llm` through `pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/` or `pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/`
 You could try to install IPEX-LLM dependencies for Intel XPU from source archives:
 - For Windows system, refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#install-ipex-llm-from-wheel) for the steps.
 - For Linux system, refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#id3) for the steps.
 ### PyTorch is not linked with support for xpu devices
 1. Before running on Intel GPUs, please make sure you've prepared environment follwing [installation instruction](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html).
 2. If you are using an older version of `ipex-llm` (specifically, older than 2.5.0b20240104), you need to manually add `import intel_extension_for_pytorch as ipex` at the beginning of your code.
 3. After optimizing the model with IPEX-LLM, you need to move model to GPU through `model = model.to('xpu')`.
 4. If you have mutil GPUs, you could refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/KeyFeatures/multi_gpus_selection.html) for details about GPU selection.
 5. If you do inference using the optimized model on Intel GPUs, you also need to set `to('xpu')` for input tensors.
 ### Import `intel_extension_for_pytorch` error on Windows GPU
 Please refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#error-loading-intel-extension-for-pytorch) for detailed guide. We list the possible missing requirements in environment which could lead to this error.
 ### XPU device count is zero
 It's recommended to reinstall driver:
 - For Windows system, refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#prerequisites) for the steps.
 - For Linux system, refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#id1) for the steps.
 ### Error such as `The size of tensor a (33) must match the size of tensor b (17) at non-singleton dimension 2` duing attention forward function
 If you are using IPEX-LLM PyTorch API, please try to set `optimize_llm=False` manually when call `optimize_model` function to work around. As for IPEX-LLM `transformers`-style API, you could try to set `optimize_model=False` manually when call `from_pretrained` function to work around.
 ### ValueError: Unrecognized configuration class
 This error is not quite relevant to IPEX-LLM. It could be that you're using the incorrect AutoClass, or the transformers version is not updated, or transformers does not support using AutoClasses to load this model. You need to refer to the model card in huggingface to confirm these information. Besides, if you load the model from local path, please also make sure you download the complete model files.
 ### `mixed dtype (CPU): expect input to have scalar type of BFloat16` during inference
 You could solve this error by converting the optimized model to `bf16` through `model.to(torch.bfloat16)` before inference.
 ### Native API failed. Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) -5 (PI_ERROR_OUT_OF_RESOURCES)
 This error is caused by out of GPU memory. Some possible solutions to decrease GPU memory uage:
 1. If you run several models continuously, please make sure you have released GPU memory of previous model through `del model` timely.
 2. You could try `model = model.float16()` or `model = model.bfloat16()` before moving model to GPU to use less GPU memory.
 3. You could try set `cpu_embedding=True` when call `from_pretrained` of AutoClass or `optimize_model` function.
 ### Failed to enable AMX
 You could use `export BIGDL_LLM_AMX_DISABLED=1` to disable AMX manually and solve this error.
 ### oneCCL: comm_selector.cpp:57 create_comm_impl: EXCEPTION: ze_data was not initialized
 You may encounter this error during finetuning on multi GPUs. Please try `sudo apt install level-zero-dev` to fix it.
 ### Random and unreadable output of Gemma-7b-it on Arc770 ubuntu 22.04 due to driver and OneAPI missmatching.
 If driver and OneAPI missmatching, it will lead to some error when IPEX-LLM uses XMX(short prompts) for speeding up.
 The output of `What's AI?` may like below:
 ```
 wiedzy Artificial Intelligence meliti: Artificial Intelligence undenti beng beng beng beng beng beng beng beng beng beng beng beng beng beng beng beng beng beng beng beng beng beng
 ```
 If you meet this error. Please check your driver version and OneAPI version. Commnad is `sudo apt list --installed | egrep "intel-basekit|intel-level-zero-gpu"`. 
 Make sure intel-basekit>=2024.0.1-43 and intel-level-zero-gpu>=1.3.27191.42-775~22.04.
 ### Too many open files
 You may encounter this error during finetuning, expecially when run 70B model. Please raise the system open file limit using `ulimit -n 1048576`.
 ### `RuntimeError: could not create a primitive` on Windows
 This error may happen when multiple GPUs exists for Windows Users. To solve this error, you can open Device Manager (search "Device Manager" in your start menu). Then click the "Display adapter" option, and disable all the GPU device you do not want to use. Restart your computer and try again. IPEX-LLM should work fine this time.
--- a/docs/mddocs/Overview/KeyFeatures/cli.md
+++ b/docs/mddocs/Overview/KeyFeatures/cli.md
@ -0,0 +1,40 @@
 # CLI (Command Line Interface) Tool
 ```eval_rst
 .. note:: 
   Currently ``ipex-llm`` CLI supports *LLaMA* (e.g., vicuna), *GPT-NeoX* (e.g., redpajama), *BLOOM* (e.g., pheonix) and *GPT2* (e.g., starcoder) model architecture; for other models, you may use the ``transformers``-style or LangChain APIs.
 ```
 ## Convert Model
 You may convert the downloaded model into native INT4 format using `llm-convert`.
 ```bash
 # convert PyTorch (fp16 or fp32) model; 
 # llama/bloom/gptneox/starcoder model family is currently supported
 llm-convert "/path/to/model/" --model-format pth --model-family "bloom" --outfile "/path/to/output/"
 # convert GPTQ-4bit model
 # only llama model family is currently supported
 llm-convert "/path/to/model/" --model-format gptq --model-family "llama" --outfile "/path/to/output/"
 ```
 ## Run Model
 You may run the converted model using `llm-cli` or `llm-chat` (built on top of `main.cpp` in [`llama.cpp`](https://github.com/ggerganov/llama.cpp))
 ```bash
 # help
 # llama/bloom/gptneox/starcoder model family is currently supported
 llm-cli -x gptneox -h
 # text completion
 # llama/bloom/gptneox/starcoder model family is currently supported
 llm-cli -t 16 -x gptneox -m "/path/to/output/model.bin" -p 'Once upon a time,'
 # chat mode
 # llama/gptneox model family is currently supported
 llm-chat -m "/path/to/output/model.bin" -x llama
 ```
--- a/docs/mddocs/Overview/KeyFeatures/finetune.md
+++ b/docs/mddocs/Overview/KeyFeatures/finetune.md
@ -0,0 +1,64 @@
 # Finetune (QLoRA)
 We also support finetuning LLMs (large language models) using QLoRA with IPEX-LLM 4bit optimizations on Intel GPUs.
 ```eval_rst
 .. note::
   Currently, only Hugging Face Transformers models are supported running QLoRA finetuning.
 ```
 To help you better understand the finetuning process, here we use model [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) as an example.
 **Make sure you have prepared environment following instructions [here](../install_gpu.html).**
 ```eval_rst
 .. note::
   If you are using an older version of ``ipex-llm`` (specifically, older than 2.5.0b20240104), you need to manually add ``import intel_extension_for_pytorch as ipex`` at the beginning of your code.
 ```
 First, load model using `transformers`-style API and **set it to `to('xpu')`**. We specify `load_in_low_bit="nf4"` here to apply 4-bit NormalFloat optimization. According to the [QLoRA paper](https://arxiv.org/pdf/2305.14314.pdf), using `"nf4"` could yield better model quality than `"int4"`.
 ```python
 from ipex_llm.transformers import AutoModelForCausalLM
 model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf",
                                             load_in_low_bit="nf4",
                                             optimize_model=False,
                                             torch_dtype=torch.float16,
                                             modules_to_not_convert=["lm_head"],)
 model = model.to('xpu')
 ```
 Then, we have to apply some preprocessing to the model to prepare it for training.
 ```python
 from ipex_llm.transformers.qlora import prepare_model_for_kbit_training
 model.gradient_checkpointing_enable()
 model = prepare_model_for_kbit_training(model)
 ```
 Next, we can obtain a Peft model from the optimized model and a configuration object containing the parameters as follows:
 ```python
 from ipex_llm.transformers.qlora import get_peft_model
 from peft import LoraConfig
 config = LoraConfig(r=8, 
                    lora_alpha=32, 
                    target_modules=["q_proj", "k_proj", "v_proj"], 
                    lora_dropout=0.05, 
                    bias="none", 
                    task_type="CAUSAL_LM")
 model = get_peft_model(model, config)
 ```
 ```eval_rst
 .. important::
   Instead of ``from peft import prepare_model_for_kbit_training, get_peft_model`` as we did for regular QLoRA using bitandbytes and cuda, we import them from ``ipex_llm.transformers.qlora`` here to get a IPEX-LLM compatible Peft model. And the rest is just the same as regular LoRA finetuning process using ``peft``.
 ```
 ```eval_rst
 .. seealso::
   See the complete examples `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU>`_
 ```
--- a/docs/mddocs/Overview/KeyFeatures/gpu_supports.rst
+++ b/docs/mddocs/Overview/KeyFeatures/gpu_supports.rst
@ -0,0 +1,14 @@
 GPU Supports
 ================================
 IPEX-LLM not only supports running large language models for inference, but also supports QLoRA finetuning on Intel GPUs.
 * |inference_on_gpu|_
 * `Finetune (QLoRA) <./finetune.html>`_
 * `Multi GPUs selection <./multi_gpus_selection.html>`_
 .. |inference_on_gpu| replace:: Inference on GPU
 .. _inference_on_gpu: ./inference_on_gpu.html
 .. |multi_gpus_selection| replace:: Multi GPUs selection
 .. _multi_gpus_selection: ./multi_gpus_selection.html
--- a/docs/mddocs/Overview/KeyFeatures/hugging_face_format.md
+++ b/docs/mddocs/Overview/KeyFeatures/hugging_face_format.md
@ -0,0 +1,54 @@
 # Hugging Face ``transformers`` Format
 ## Load in Low Precision
 You may apply INT4 optimizations to any Hugging Face *Transformers* models as follows:
 ```python
 # load Hugging Face Transformers model with INT4 optimizations
 from ipex_llm.transformers import AutoModelForCausalLM
 model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)
 ```
 After loading the Hugging Face *Transformers* model, you may easily run the optimized model as follows:
 ```python
 # run the optimized model
 from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained(model_path)
 input_ids = tokenizer.encode(input_str, ...)
 output_ids = model.generate(input_ids, ...)
 output = tokenizer.batch_decode(output_ids)
 ```
 ```eval_rst
 .. seealso::
   See the complete CPU examples `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels>`_ and GPU examples `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels>`_.
 .. note::
   You may apply more low bit optimizations (including INT8, INT5 and INT4) as follows:
   .. code-block:: python
      model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int5")
   See the CPU example `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/More-Data-Types>`_ and GPU example `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types>`_.
 ```
 ## Save & Load
 After the model is optimized using INT4 (or INT8/INT5), you may save and load the optimized model as follows:
 ```python
 model.save_low_bit(model_path)
 new_model = AutoModelForCausalLM.load_low_bit(model_path)
 ```
 ```eval_rst
 .. seealso::
   See the CPU example `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Save-Load>`_ and GPU example `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Save-Load>`_
 ```
--- a/docs/mddocs/Overview/KeyFeatures/index.rst
+++ b/docs/mddocs/Overview/KeyFeatures/index.rst
@ -0,0 +1,33 @@
 IPEX-LLM Key Features
 ================================
 You may run the LLMs using ``ipex-llm`` through one of the following APIs:
 * `PyTorch API <./optimize_model.html>`_
 * |transformers_style_api|_
  * |hugging_face_transformers_format|_
  * `Native Format <./native_format.html>`_
 * `LangChain API <./langchain_api.html>`_
 * |gpu_supports|_
  * |inference_on_gpu|_
  * `Finetune (QLoRA) <./finetune.html>`_
  * `Multi GPUs selection <./multi_gpus_selection.html>`_
 .. |transformers_style_api| replace:: ``transformers``-style API
 .. _transformers_style_api: ./transformers_style_api.html
 .. |hugging_face_transformers_format| replace:: Hugging Face ``transformers`` Format
 .. _hugging_face_transformers_format: ./hugging_face_format.html
 .. |gpu_supports| replace:: GPU Supports
 .. _gpu_supports: ./gpu_supports.html
 .. |inference_on_gpu| replace:: Inference on GPU
 .. _inference_on_gpu: ./inference_on_gpu.html
 .. |multi_gpus_selection| replace:: Multi GPUs selection
 .. _multi_gpus_selection: ./multi_gpus_selection.html
--- a/docs/mddocs/Overview/KeyFeatures/inference_on_gpu.md
+++ b/docs/mddocs/Overview/KeyFeatures/inference_on_gpu.md
@ -0,0 +1,128 @@
 # Inference on GPU
 Apart from the significant acceleration capabilites on Intel CPUs, IPEX-LLM also supports optimizations and acceleration for running LLMs (large language models) on Intel GPUs. With IPEX-LLM, PyTorch models (in FP16/BF16/FP32) can be optimized with low-bit quantizations (supported precisions include INT4, INT5, INT8, etc).
 Compared with running on Intel CPUs, some additional operations are required on Intel GPUs. To help you better understand the process, here we use a popular model [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) as an example.
 **Make sure you have prepared environment following instructions [here](../install_gpu.html).**
 ```eval_rst
 .. note::
   If you are using an older version of ``ipex-llm`` (specifically, older than 2.5.0b20240104), you need to manually add ``import intel_extension_for_pytorch as ipex`` at the beginning of your code.
 ```
 ## Load and Optimize Model
 You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-style API](./transformers_style_api.html) on Intel GPUs according to your preference.
 **Once you have the model with IPEX-LLM low bit optimization, set it to `to('xpu')`**.
 ```eval_rst
 .. tabs::
   .. tab:: PyTorch API
      You could optimize any PyTorch model with "one-line code change", and the loading and optimizing process on Intel GPUs maybe as follows:
      .. code-block:: python
         # Take Llama-2-7b-chat-hf as an example
         from transformers import LlamaForCausalLM
         from ipex_llm import optimize_model
         model = LlamaForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', torch_dtype='auto', low_cpu_mem_usage=True)
         model = optimize_model(model) # With only one line to enable IPEX-LLM INT4 optimization
         model = model.to('xpu') # Important after obtaining the optimized model
      .. tip::
         When running LLMs on Intel iGPUs for Windows users, we recommend setting ``cpu_embedding=True`` in the ``optimize_model`` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
         See the `API doc <../../../PythonAPI/LLM/optimize.html#ipex_llm.optimize_model>`_ for ``optimize_model`` to find more information.
      Especially, if you have saved the optimized model following setps `here <./optimize_model.html#save>`_, the loading process on Intel GPUs maybe as follows:
      .. code-block:: python
         from transformers import LlamaForCausalLM
         from ipex_llm.optimize import low_memory_init, load_low_bit
         saved_dir='./llama-2-ipex-llm-4-bit'
         with low_memory_init(): # Fast and low cost by loading model on meta device
            model = LlamaForCausalLM.from_pretrained(saved_dir,
                                                     torch_dtype="auto",
                                                     trust_remote_code=True)
         model = load_low_bit(model, saved_dir) # Load the optimized model
         model = model.to('xpu') # Important after obtaining the optimized model
   .. tab:: ``transformers``-style API
      You could run any Hugging Face Transformers model with ``transformers``-style API, and the loading and optimizing process on Intel GPUs maybe as follows:
      .. code-block:: python
         # Take Llama-2-7b-chat-hf as an example
         from ipex_llm.transformers import AutoModelForCausalLM
         # Load model in 4 bit, which convert the relevant layers in the model into INT4 format
         model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', load_in_4bit=True)
         model = model.to('xpu') # Important after obtaining the optimized model
      .. tip::
         When running LLMs on Intel iGPUs for Windows users, we recommend setting ``cpu_embedding=True`` in the ``from_pretrained`` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
         See the `API doc <../../../PythonAPI/LLM/transformers.html#hugging-face-transformers-automodel>`_ to find more information.
      Especially, if you have saved the optimized model following setps `here <./hugging_face_format.html#save-load>`_, the loading process on Intel GPUs maybe as follows:
      .. code-block:: python
         from ipex_llm.transformers import AutoModelForCausalLM
         saved_dir='./llama-2-ipex-llm-4-bit'
         model = AutoModelForCausalLM.load_low_bit(saved_dir) # Load the optimized model
         model = model.to('xpu') # Important after obtaining the optimized model
      .. tip::
         When running saved optimized models on Intel iGPUs for Windows users, we also recommend setting ``cpu_embedding=True`` in the ``load_low_bit`` function.
 ```
 ## Run Optimized Model
 You could then do inference using the optimized model on Intel GPUs almostly the same as on CPUs. **The only difference is to set `to('xpu')` for input tensors.**
 Continuing with the [example of Llama-2-7b-chat-hf](#load-and-optimize-model), running as follows:
 ```python
 import torch
 with torch.inference_mode():
   prompt = 'Q: What is CPU?\nA:'
   input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu') # With .to('xpu') specifically for inference on Intel GPUs
   output = model.generate(input_ids, max_new_tokens=32)
   output_str = tokenizer.decode(output[0], skip_special_tokens=True)
 ```
 ```eval_rst
 .. note::
   The initial generation of optimized LLMs on Intel GPUs could be slow. Therefore, it's recommended to perform a **warm-up** run before the actual generation.
 ```
 ```eval_rst
 .. note::
   If you are a Windows user, please also note that for **the first time** that **each model** runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 ```
 ```eval_rst
 .. seealso::
   See the complete examples `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU>`_
 ```
--- a/docs/mddocs/Overview/KeyFeatures/langchain_api.md
+++ b/docs/mddocs/Overview/KeyFeatures/langchain_api.md
@ -0,0 +1,57 @@
 # LangChain API
 You may run the models using the LangChain API in `ipex-llm`.
 ## Using Hugging Face `transformers` INT4 Format
 You may run any Hugging Face *Transformers* model (with INT4 optimiztions applied) using the LangChain API as follows:
 ```python
 from ipex_llm.langchain.llms import TransformersLLM
 from ipex_llm.langchain.embeddings import TransformersEmbeddings
 from langchain.chains.question_answering import load_qa_chain
 embeddings = TransformersEmbeddings.from_model_id(model_id=model_path)
 ipex_llm = TransformersLLM.from_model_id(model_id=model_path, ...)
 doc_chain = load_qa_chain(ipex_llm, ...)
 output = doc_chain.run(...)
 ```
 ```eval_rst
 .. seealso::
   See the examples `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/LangChain/transformers_int4>`_.
 ```
 ## Using Native INT4 Format
 You may also convert Hugging Face *Transformers* models into native INT4 format, and then run the converted models using the LangChain API as follows.
 ```eval_rst
 .. note::
   * Currently only llama/bloom/gptneox/starcoder model families are supported; for other models, you may use the Hugging Face ``transformers`` INT4 format as described `above <./langchain_api.html#using-hugging-face-transformers-int4-format>`_.
   * You may choose the corresponding API developed for specific native models to load the converted model.
 ```
 ```python
 from ipex_llm.langchain.llms import LlamaLLM
 from ipex_llm.langchain.embeddings import LlamaEmbeddings
 from langchain.chains.question_answering import load_qa_chain
 # switch to GptneoxEmbeddings/BloomEmbeddings/StarcoderEmbeddings to load other models
 embeddings = LlamaEmbeddings(model_path='/path/to/converted/model.bin')
 # switch to GptneoxLLM/BloomLLM/StarcoderLLM to load other models
 ipex_llm = LlamaLLM(model_path='/path/to/converted/model.bin')
 doc_chain = load_qa_chain(ipex_llm, ...)
 doc_chain.run(...)
 ```
 ```eval_rst
 .. seealso::
   See the examples `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/LangChain/native_int4>`_.
 ```
--- a/docs/mddocs/Overview/KeyFeatures/multi_gpus_selection.md
+++ b/docs/mddocs/Overview/KeyFeatures/multi_gpus_selection.md
@ -0,0 +1,86 @@
 # Multi Intel GPUs selection
 In [Inference on GPU](inference_on_gpu.md) and [Finetune (QLoRA)](finetune.md), you have known how to run inference and finetune on Intel GPUs. In this section, we will show you two approaches to select GPU devices.
 ## List devices
 The `sycl-ls` tool enumerates a list of devices available in the system. You can use it after you setup oneapi environment:
 ```eval_rst
 .. tabs::
   .. tab:: Windows
      Please make sure you are using CMD (Miniforge Prompt if using conda):
      .. code-block:: cmd
        call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
        sycl-ls
   .. tab:: Linux
      .. code-block:: bash
         source /opt/intel/oneapi/setvars.sh
         sycl-ls
 ```
 If you have two Arc770 GPUs, you can get something like below:
 ```
 [opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
 [opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) i9-14900K 3.0 [2023.16.7.0.21_160000]
 [opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33]
 [opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33]
 [opencl:gpu:4] Intel(R) OpenCL Graphics, Intel(R) UHD Graphics 770 3.0 [23.17.26241.33]
 [ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
 [ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
 [ext_oneapi_level_zero:gpu:2] Intel(R) Level-Zero, Intel(R) UHD Graphics 770 1.3 [1.3.26241]
 ```
 This output shows there are two Arc A770 GPUs as well as an Intel iGPU on this machine.
 ## Devices selection
 To enable xpu, you should convert your model and input to xpu by below code:
 ```
 model = model.to('xpu')
 input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
 ```
 To select the desired devices, there are two ways: one is changing the code, another is adding an environment variable. See:  
 ### 1. Select device in python
 To specify a xpu, you can change the `to('xpu')` to `to('xpu:[device_id]')`, this device_id is counted from zero.
 If you you want to use the second device, you can change the code like this: 
 ```
 model = model.to('xpu:1')
 input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu:1')
 ```
 ### 2. OneAPI device selector
 Device selection environment variable, `ONEAPI_DEVICE_SELECTOR`, can be used to limit the choice of Intel GPU devices. As upon `sycl-ls` shows, the last three lines are three Level Zero GPU devices. So we can use `ONEAPI_DEVICE_SELECTOR=level_zero:[gpu_id]` to select devices.
 For example, you want to use the second A770 GPU, you can run the python like this:
 ```eval_rst
 .. tabs::
   .. tab:: Windows
      .. code-block:: cmd
         set ONEAPI_DEVICE_SELECTOR=level_zero:1 
         python generate.py
      Through ``set ONEAPI_DEVICE_SELECTOR=level_zero:1``, only the second A770 GPU will be available for the current environment.
   .. tab:: Linux
      .. code-block:: bash
         ONEAPI_DEVICE_SELECTOR=level_zero:1 python generate.py
      ``ONEAPI_DEVICE_SELECTOR=level_zero:1`` in upon command only affect in current python program. Also, you can export the environment variable, then run your python:
      .. code-block:: bash
         export ONEAPI_DEVICE_SELECTOR=level_zero:1
         python generate.py
 ```
--- a/docs/mddocs/Overview/KeyFeatures/native_format.md
+++ b/docs/mddocs/Overview/KeyFeatures/native_format.md
@ -0,0 +1,32 @@
 # Native Format
 You may also convert Hugging Face *Transformers* models into native INT4 format for maximum performance as follows.
 ```eval_rst
 .. note::
   Currently only llama/bloom/gptneox/starcoder/chatglm model families are supported; you may use the corresponding API to load the converted model. (For other models, you can use the Hugging Face ``transformers`` format as described `here <./hugging_face_format.html>`_).
 ```
 ```python
 # convert the model
 from ipex_llm import llm_convert
 ipex_llm_path = llm_convert(model='/path/to/model/',
       outfile='/path/to/output/', outtype='int4', model_family="llama")
 # load the converted model
 # switch to ChatGLMForCausalLM/GptneoxForCausalLM/BloomForCausalLM/StarcoderForCausalLM to load other models
 from ipex_llm.transformers import LlamaForCausalLM
 llm = LlamaForCausalLM.from_pretrained("/path/to/output/model.bin", native=True, ...)
 # run the converted model
 input_ids = llm.tokenize(prompt)
 output_ids = llm.generate(input_ids, ...)
 output = llm.batch_decode(output_ids)
 ```
 ```eval_rst
 .. seealso::
   See the complete example `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/Native-Models>`_
 ```
--- a/docs/mddocs/Overview/KeyFeatures/optimize_model.md
+++ b/docs/mddocs/Overview/KeyFeatures/optimize_model.md
@ -0,0 +1,69 @@
 ## PyTorch API
 In general, you just need one-line `optimize_model` to easily optimize any loaded PyTorch model, regardless of the library or API you are using. With IPEX-LLM, PyTorch models (in FP16/BF16/FP32) can be optimized with low-bit quantizations (supported precisions include INT4, INT5, INT8, etc).
 ### Optimize model
 First, use any PyTorch APIs you like to load your model. To help you better understand the process, here we use [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) library `LlamaForCausalLM` to load a popular model [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) as an example:
 ```python
 # Create or load any Pytorch model, take Llama-2-7b-chat-hf as an example
 from transformers import LlamaForCausalLM
 model = LlamaForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', torch_dtype='auto', low_cpu_mem_usage=True)
 ```
 Then, just need to call `optimize_model` to optimize the loaded model and INT4 optimization is applied on model by default: 
 ```python
 from ipex_llm import optimize_model
 # With only one line to enable IPEX-LLM INT4 optimization
 model = optimize_model(model)
 ```
 After optimizing the model, IPEX-LLM does not require any change in the inference code. You can use any libraries to run the optimized model with very low latency.
 ### More Precisions
 In the [Optimize Model](#optimize-model), symmetric INT4 optimization is applied by default. You may apply other low bit optimizations (INT5, INT8, etc) by specifying the ``low_bit`` parameter.
 Currently, ``low_bit`` supports options 'sym_int4', 'asym_int4', 'sym_int5', 'asym_int5' or 'sym_int8', in which 'sym' and 'asym' differentiate between symmetric and asymmetric quantization. Symmetric quantization allocates bits for positive and negative values equally, whereas asymmetric quantization allows different bit allocations for positive and negative values.
 You may apply symmetric INT8 optimization as follows:
 ```python
 from ipex_llm import optimize_model
 # Apply symmetric INT8 optimization
 model = optimize_model(model, low_bit="sym_int8")
 ```
 ### Save & Load Optimized Model
 The loading process of the original model may be time-consuming and memory-intensive. For example, the [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) model is stored with float16 precision, resulting in large memory usage when loaded using `LlamaForCausalLM`. To avoid high resource consumption and expedite loading process, you can use `save_low_bit` to store the model after low-bit optimization. Then, in subsequent uses, you can opt to use the `load_low_bit` API to directly load the optimized model. Besides, saving and loading operations are platform-independent, regardless of their operating systems.
 #### Save
 Continuing with the [example of Llama-2-7b-chat-hf](#optimize-model), we can save the previously optimized model as follows:
 ```python
 saved_dir='./llama-2-ipex-llm-4-bit'
 model.save_low_bit(saved_dir)
 ```
 #### Load
 We recommend to use the context manager `low_memory_init` to quickly initiate a model instance with low cost, and then use `load_low_bit` to load the optimized low-bit model as follows:
 ```python
 from ipex_llm.optimize import low_memory_init, load_low_bit
 with low_memory_init(): # Fast and low cost by loading model on meta device
   model = LlamaForCausalLM.from_pretrained(saved_dir,
                                            torch_dtype="auto",
                                            trust_remote_code=True)
 model = load_low_bit(model, saved_dir) # Load the optimized model
 ```
 ```eval_rst
 .. seealso::
   * Please refer to the `API documentation <https://ipex-llm.readthedocs.io/en/latest/doc/PythonAPI/LLM/optimize.html>`_ for more details.
   * We also provide detailed examples on how to run PyTorch models (e.g., Openai Whisper, LLaMA2, ChatGLM2, Falcon, MPT, Baichuan2, etc.) using IPEX-LLM. See the complete CPU examples `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models>`_ and GPU examples `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models>`_.
 ```
--- a/docs/mddocs/Overview/KeyFeatures/transformers_style_api.rst
+++ b/docs/mddocs/Overview/KeyFeatures/transformers_style_api.rst
@ -0,0 +1,10 @@
 ``transformers``-style API
 ================================
 You may run the LLMs using ``transformers``-style API in ``ipex-llm``.
 * |hugging_face_transformers_format|_
 * `Native Format <./native_format.html>`_
 .. |hugging_face_transformers_format| replace:: Hugging Face ``transformers`` Format
 .. _hugging_face_transformers_format: ./hugging_face_format.html
--- a/docs/mddocs/Overview/examples.rst
+++ b/docs/mddocs/Overview/examples.rst
@ -0,0 +1,9 @@
 IPEX-LLM Examples
 ================================
 You can use IPEX-LLM to run any PyTorch model with INT4 optimizations on Intel XPU (from Laptop to GPU to Cloud).
 Here, we provide examples to help you quickly get started using IPEX-LLM to run some popular open-source models in the community. Please refer to the appropriate guide based on your device:
 * `CPU <./examples_cpu.html>`_
 * `GPU <./examples_gpu.html>`_
--- a/docs/mddocs/Overview/examples_cpu.md
+++ b/docs/mddocs/Overview/examples_cpu.md
@ -0,0 +1,64 @@
 # IPEX-LLM Examples: CPU
 Here, we provide some examples on how you could apply IPEX-LLM INT4 optimizations on popular open-source models in the community.
 To run these examples, please first refer to [here](./install_cpu.html) for more information about how to install ``ipex-llm``, requirements and best practices for setting up your environment.
 The following models have been verified on either servers or laptops with Intel CPUs.
 ## Example of PyTorch API
 | Model      | Example of PyTorch API                                |
 |------------|-------------------------------------------------------|
 | LLaMA 2    | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models/Model/llama2)  |
 | ChatGLM    | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models/Model/chatglm) |
 | Mistral    | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models/Model/mistral) |
 | Bark       | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models/Model/bark)    |
 | BERT       | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models/Model/bert)    |
 | Openai Whisper    | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models/Model/openai-whisper) |
 ```eval_rst
 .. important::
   In addition to INT4 optimization, IPEX-LLM also provides other low bit optimizations (such as INT8, INT5, NF4, etc.). You may apply other low bit optimizations through PyTorch API as `example <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models/More-Data-Types>`_.
 ```
 ## Example of `transformers`-style API
 | Model      | Example of `transformers`-style API                   |
 |------------|-------------------------------------------------------|
 | LLaMA *(such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.)* | [link1](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/Native-Models), [link2](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/vicuna) |
 | LLaMA 2    | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models/Model/llama2) | [link1](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/Native-Models), [link2](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/llama2) |
 | ChatGLM    | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/PyTorch-Models/Model/chatglm) | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/chatglm)   |
 | ChatGLM2   | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/chatglm2)  |
 | Mistral    | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/mistral)   |
 | Falcon     | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/falcon)    |
 | MPT        | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/mpt)       |
 | Dolly-v1   | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/dolly_v1)  |
 | Dolly-v2   | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/dolly_v2)  |
 | Replit Code| [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/replit)    |
 | RedPajama  | [link1](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/Native-Models), [link2](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/redpajama) |
 | Phoenix    | [link1](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/Native-Models), [link2](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/phoenix)   |
 | StarCoder  | [link1](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/Native-Models), [link2](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/starcoder) |
 | Baichuan   | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/baichuan)  |
 | Baichuan2  | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/baichuan2) |
 | InternLM   | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/internlm)  |
 | Qwen       | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen)      |
 | Aquila     | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/aquila)    |
 | MOSS       | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/moss)      |
 | Whisper    | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/whisper)   |
 ```eval_rst
 .. important::
   In addition to INT4 optimization, IPEX-LLM also provides other low bit optimizations (such as INT8, INT5, NF4, etc.). You may apply other low bit optimizations through ``transformers``-style API as `example <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/More-Data-Types>`_.
 ```
 ```eval_rst
 .. seealso::
   See the complete examples `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU>`_.
 ```
--- a/docs/mddocs/Overview/examples_gpu.md
+++ b/docs/mddocs/Overview/examples_gpu.md
@ -0,0 +1,70 @@
 # IPEX-LLM Examples: GPU
 Here, we provide some examples on how you could apply IPEX-LLM INT4 optimizations on popular open-source models in the community.
 To run these examples, please first refer to [here](./install_gpu.html) for more information about how to install ``ipex-llm``, requirements and best practices for setting up your environment.
 ```eval_rst
 .. important::
   Only Linux system is supported now, Ubuntu 22.04 is prefered.
 ```
 The following models have been verified on either servers or laptops with Intel GPUs.
 ## Example of PyTorch API
 | Model      | Example of PyTorch API                                |
 |------------|-------------------------------------------------------|
 | LLaMA 2    | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models/Model/llama2)    |
 | ChatGLM 2  | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models/Model/chatglm2)  |
 | Mistral    | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models/Model/mistral)   |
 | Baichuan   | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models/Model/baichuan)  |
 | Baichuan2  | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models/Model/baichuan2) |
 | Replit     | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models/Model/replit)    |
 | StarCoder  | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models/Model/starcoder) |
 | Dolly-v1   | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models/Model/dolly-v1)  |
 | Dolly-v2   | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models/Model/dolly-v2)  |
 ```eval_rst
 .. important::
   In addition to INT4 optimization, IPEX-LLM also provides other low bit optimizations (such as INT8, INT5, NF4, etc.). You may apply other low bit optimizations through PyTorch API as `example <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/PyTorch-Models/More-Data-Types>`_.
 ```
 ## Example of `transformers`-style API
 | Model      | Example of `transformers`-style API                   |
 |------------|-------------------------------------------------------|
 | LLaMA *(such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.)* |[link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/vicuna)|
 | LLaMA 2    | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2) |
 | ChatGLM2   | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm2)   |
 | Mistral    | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral)    |
 | Falcon     | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/falcon)     |
 | MPT        | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/mpt)        |
 | Dolly-v1   | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/dolly_v1)   | 
 | Dolly-v2   | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/dolly_v2)   | 
 | Replit     | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/replit)     |
 | StarCoder  | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/starcoder)  | 
 | Baichuan   | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/baichuan)   |
 | Baichuan2  | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan2)  |
 | InternLM   | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm)   |
 | Qwen       | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen)       |
 | Aquila     | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/aquila)     |
 | Whisper    | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/whisper)    |
 | Chinese Llama2	    | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chinese-llama2)    |
 | GPT-J    | [link](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/gpt-j)    |
 ```eval_rst
 .. important::
   In addition to INT4 optimization, IPEX-LLM also provides other low bit optimizations (such as INT8, INT5, NF4, etc.). You may apply other low bit optimizations through ``transformers``-style API as `example <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types>`_.
 ```
 ```eval_rst
 .. seealso::
   See the complete examples `here <https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU>`_.
 ```
--- a/docs/mddocs/Overview/install.rst
+++ b/docs/mddocs/Overview/install.rst
@ -0,0 +1,7 @@
 IPEX-LLM Installation
 ================================
 Here, we provide instructions on how to install ``ipex-llm`` and best practices for setting up your environment. Please refer to the appropriate guide based on your device:
 * `CPU <./install_cpu.html>`_
 * `GPU <./install_gpu.html>`_
--- a/docs/mddocs/Overview/install_cpu.md
+++ b/docs/mddocs/Overview/install_cpu.md
@ -0,0 +1,100 @@
 # IPEX-LLM Installation: CPU
 ## Quick Installation
 Install IPEX-LLM for CPU supports using pip through:
 ```eval_rst	
 .. tabs::
   .. tab:: Linux
      .. code-block:: bash
         pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
   .. tab:: Windows
      .. code-block:: cmd
         pip install --pre --upgrade ipex-llm[all]
 ```
 Please refer to [Environment Setup](#environment-setup) for more information.
 ```eval_rst
 .. note::
   ``all`` option will trigger installation of all the dependencies for common LLM application development.
 .. important::
   ``ipex-llm`` is tested with Python 3.9, 3.10 and 3.11; Python 3.11 is recommended for best practices.
 ```
 ## Recommended Requirements
 Here list the recommended hardware and OS for smooth IPEX-LLM optimization experiences on CPU:
 * Hardware
  * PCs equipped with 12th Gen Intel® Core™ processor or higher, and at least 16GB RAM
  * Servers equipped with Intel® Xeon® processors, at least 32G RAM.
 * Operating System
  * Ubuntu 20.04 or later
  * CentOS 7 or later
  * Windows 10/11, with or without WSL
 ## Environment Setup
 For optimal performance with LLM models using IPEX-LLM optimizations on Intel CPUs, here are some best practices for setting up environment:
 First we recommend using [Conda](https://conda-forge.org/download/) to create a python 3.11 enviroment:
 ```eval_rst	
 .. tabs::
   .. tab:: Linux
      .. code-block:: bash
         conda create -n llm python=3.11
         conda activate llm
         pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
   .. tab:: Windows
      .. code-block:: cmd
         conda create -n llm python=3.11
         conda activate llm
         pip install --pre --upgrade ipex-llm[all]
 ```
 Then for running a LLM model with IPEX-LLM optimizations (taking an `example.py` an example):
 ```eval_rst	
 .. tabs::
   .. tab:: Client
      It is recommended to run directly with full utilization of all CPU cores:
      .. code-block:: bash
         python example.py
   .. tab:: Server
      It is recommended to run with all the physical cores of a single socket:
      .. code-block:: bash
         # e.g. for a server with 48 cores per socket
         export OMP_NUM_THREADS=48
         numactl -C 0-47 -m 0 python example.py
 ```
--- a/docs/mddocs/Overview/install_gpu.md
+++ b/docs/mddocs/Overview/install_gpu.md
@ -0,0 +1,666 @@
 # IPEX-LLM Installation: GPU
 ## Windows
 ### Prerequisites
 IPEX-LLM on Windows supports Intel iGPU and dGPU.
 ```eval_rst
 .. important::
    IPEX-LLM on Windows only supports PyTorch 2.1.
 ```
 To apply Intel GPU acceleration, please first verify your GPU driver version.
 ```eval_rst
 .. note::
   The GPU driver version of your device can be checked in the "Task Manager" -> GPU 0 (or GPU 1, etc.) -> Driver version.
 ```
 If you have driver version lower than `31.0.101.5122`, it is recommended to [**update your GPU driver to the latest**](https://www.intel.com/content/www/us/en/download/785597/intel-arc-iris-xe-graphics-windows.html):
 <!-- Intel® oneAPI Base Toolkit 2024.0 installation methods:
 ```eval_rst
 .. tabs::
   .. tab:: Offline installer
      Download and install `Intel® oneAPI Base Toolkit <https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html?operatingsystem=window&distributions=offline>`_ version 2024.0 through Offline Installer.
      During installation, you could just continue with "Recommended Installation". If you would like to continue with "Custom Installation", please note that oneAPI Deep Neural Network Library, oneAPI Math Kernel Library, and oneAPI DPC++/C++ Compiler are required, the other components are optional.
   .. tab:: PIP installer
      Pip install oneAPI in your working conda environment.
      .. code-block:: bash
         pip install dpcpp-cpp-rt==2024.0.2 mkl-dpcpp==2024.0.0 onednn==2024.0.0
      .. note::
         Activating your working conda environment will automatically configure oneAPI environment variables.
 ``` -->
 ### Install IPEX-LLM
 #### Install IPEX-LLM From PyPI
 We recommend using [Miniforge](https://conda-forge.org/download/) to create a python 3.11 enviroment.
 ```eval_rst
 .. important::
   ``ipex-llm`` is tested with Python 3.9, 3.10 and 3.11. Python 3.11 is recommended for best practices.
 ```
 The easiest ways to install `ipex-llm` is the following commands, choosing either US or CN website for `extra-index-url`:
 ```eval_rst
 .. tabs::
   .. tab:: US
      .. code-block:: cmd
         conda create -n llm python=3.11 libuv
         conda activate llm
         pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
   .. tab:: CN
      .. code-block:: cmd
         conda create -n llm python=3.11 libuv
         conda activate llm
         pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/
 ```
 #### Install IPEX-LLM From Wheel
 If you encounter network issues when installing IPEX, you can also install IPEX-LLM dependencies for Intel XPU from source archives. First you need to download and install torch/torchvision/ipex from wheels listed below before installing `ipex-llm`.
 Download the wheels on Windows system:
 ```
 wget https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/torch-2.1.0a0%2Bcxx11.abi-cp311-cp311-win_amd64.whl
 wget https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/torchvision-0.16.0a0%2Bcxx11.abi-cp311-cp311-win_amd64.whl
 wget https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/intel_extension_for_pytorch-2.1.10%2Bxpu-cp311-cp311-win_amd64.whl
 ```
 You may install dependencies directly from the wheel archives and then install `ipex-llm` using following commands:
 ```
 pip install torch-2.1.0a0+cxx11.abi-cp311-cp311-win_amd64.whl
 pip install torchvision-0.16.0a0+cxx11.abi-cp311-cp311-win_amd64.whl
 pip install intel_extension_for_pytorch-2.1.10+xpu-cp311-cp311-win_amd64.whl
 pip install --pre --upgrade ipex-llm[xpu]
 ```
 ```eval_rst
 .. note::
   All the wheel packages mentioned here are for Python 3.11. If you would like to use Python 3.9 or 3.10, you should modify the wheel names for ``torch``, ``torchvision``, and ``intel_extension_for_pytorch`` by replacing ``cp11`` with ``cp39`` or ``cp310``, respectively.
 ```
 ### Runtime Configuration
 To use GPU acceleration on Windows, several environment variables are required before running a GPU example:
 <!-- Make sure you are using CMD (Miniforge Prompt if using conda) as PowerShell is not supported, and configure oneAPI environment variables with: 
 ```cmd
 call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 ```
 Please also set the following environment variable if you would like to run LLMs on: -->
 ```eval_rst
 .. tabs::
   .. tab:: Intel iGPU
      .. code-block:: cmd
         set SYCL_CACHE_PERSISTENT=1
         set BIGDL_LLM_XMX_DISABLED=1
   .. tab:: Intel Arc™ A-Series Graphics
      .. code-block:: cmd
         set SYCL_CACHE_PERSISTENT=1
 ```
 ```eval_rst
 .. note::
   For **the first time** that **each model** runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 ```
 ### Troubleshooting
 #### 1. Error loading `intel_extension_for_pytorch`
 If you met error when importing `intel_extension_for_pytorch`, please ensure that you have completed the following steps:
 * Ensure that you have installed Visual Studio with "Desktop development with C++" workload.
 * Make sure that the correct version of oneAPI, specifically 2024.0, is installed.
 * Ensure that `libuv` is installed in your conda environment. This can be done during the creation of the environment with the command:
  ```cmd
  conda create -n llm python=3.11 libuv
  ```
  If you missed `libuv`, you can add it to your existing environment through
  ```cmd
  conda install libuv
  ```
 <!-- * For oneAPI installed using the Offline installer, make sure you have configured oneAPI environment variables in your Miniforge Prompt through
  ```cmd
  call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
  ```
  Please note that you need to set these environment variables again once you have a new Miniforge Prompt window. -->
 ## Linux
 ### Prerequisites
 IPEX-LLM GPU support on Linux has been verified on:
 * Intel Arc™ A-Series Graphics
 * Intel Data Center GPU Flex Series
 * Intel Data Center GPU Max Series
 ```eval_rst
 .. important::
    IPEX-LLM on Linux supports PyTorch 2.0 and PyTorch 2.1. 
    .. warning::
       IPEX-LLM support for Pytorch 2.0 is deprecated as of ``ipex-llm >= 2.1.0b20240511``.
 ```
 ```eval_rst
 .. important::
    We currently support the Ubuntu 20.04 operating system and later.
 ```
 ```eval_rst
 .. tabs::
   .. tab:: PyTorch 2.1
      To enable IPEX-LLM for Intel GPUs with PyTorch 2.1, here are several prerequisite steps for tools installation and environment preparation:
      * Step 1: Install Intel GPU Driver version >= stable_775_20_20231219. We highly recommend installing the latest version of intel-i915-dkms using apt.
        .. seealso::
           Please refer to our `driver installation <https://dgpu-docs.intel.com/driver/installation.html>`_ for general purpose GPU capabilities.
           See `release page <https://dgpu-docs.intel.com/releases/index.html>`_ for latest version.
        .. note::
           For Intel Core™ Ultra integrated GPU, please make sure level_zero version >= 1.3.28717. The level_zero version can be checked with ``sycl-ls``, and verison will be tagged be ``[ext_oneapi_level_zero:gpu]``.
           .. code-block::
               [opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
               [opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) Ultra 5 125H OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
               [opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) Graphics OpenCL 3.0 NEO  [24.09.28717.12]
               [ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) Graphics 1.3 [1.3.28717]
           If you have level_zero version < 1.3.28717, you could update as follows:
           .. code-block:: bash
               wget https://github.com/intel/intel-graphics-compiler/releases/download/igc-1.0.16238.4/intel-igc-core_1.0.16238.4_amd64.deb
               wget https://github.com/intel/intel-graphics-compiler/releases/download/igc-1.0.16238.4/intel-igc-opencl_1.0.16238.4_amd64.deb
               wget https://github.com/intel/compute-runtime/releases/download/24.09.28717.12/intel-level-zero-gpu-dbgsym_1.3.28717.12_amd64.ddeb
               wget https://github.com/intel/compute-runtime/releases/download/24.09.28717.12/intel-level-zero-gpu_1.3.28717.12_amd64.deb
               wget https://github.com/intel/compute-runtime/releases/download/24.09.28717.12/intel-opencl-icd-dbgsym_24.09.28717.12_amd64.ddeb
               wget https://github.com/intel/compute-runtime/releases/download/24.09.28717.12/intel-opencl-icd_24.09.28717.12_amd64.deb
               wget https://github.com/intel/compute-runtime/releases/download/24.09.28717.12/libigdgmm12_22.3.17_amd64.deb
               sudo dpkg -i *.deb
      * Step 2: Download and install `Intel® oneAPI Base Toolkit <https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html>`_ with version 2024.0. OneDNN, OneMKL and DPC++ compiler are needed, others are optional.
      Intel® oneAPI Base Toolkit 2024.0 installation methods:
      .. tabs::
         .. tab:: APT installer
            Step 1: Set up repository
            .. code-block:: bash
               wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null
               echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
               sudo apt update
            Step 2: Install the package
            .. code-block:: bash
               sudo apt install intel-oneapi-common-vars=2024.0.0-49406 \
                  intel-oneapi-common-oneapi-vars=2024.0.0-49406 \
                  intel-oneapi-diagnostics-utility=2024.0.0-49093 \
                  intel-oneapi-compiler-dpcpp-cpp=2024.0.2-49895 \
                  intel-oneapi-dpcpp-ct=2024.0.0-49381 \
                  intel-oneapi-mkl=2024.0.0-49656 \
                  intel-oneapi-mkl-devel=2024.0.0-49656 \
                  intel-oneapi-mpi=2021.11.0-49493 \
                  intel-oneapi-mpi-devel=2021.11.0-49493 \
                  intel-oneapi-dal=2024.0.1-25 \
                  intel-oneapi-dal-devel=2024.0.1-25 \
                  intel-oneapi-ippcp=2021.9.1-5 \
                  intel-oneapi-ippcp-devel=2021.9.1-5 \
                  intel-oneapi-ipp=2021.10.1-13 \
                  intel-oneapi-ipp-devel=2021.10.1-13 \
                  intel-oneapi-tlt=2024.0.0-352 \
                  intel-oneapi-ccl=2021.11.2-5 \
                  intel-oneapi-ccl-devel=2021.11.2-5 \
                  intel-oneapi-dnnl-devel=2024.0.0-49521 \
                  intel-oneapi-dnnl=2024.0.0-49521 \
                  intel-oneapi-tcm-1.0=1.0.0-435
            .. note::
               You can uninstall the package by running the following command:
               .. code-block:: bash
                  sudo apt autoremove intel-oneapi-common-vars
         .. tab:: PIP installer
            Step 1: Install oneAPI in a user-defined folder, e.g., ``~/intel/oneapi``.
            .. code-block:: bash
               export PYTHONUSERBASE=~/intel/oneapi
               pip install dpcpp-cpp-rt==2024.0.2 mkl-dpcpp==2024.0.0 onednn==2024.0.0 --user
            .. note::
               The oneAPI packages are visible in ``pip list`` only if ``PYTHONUSERBASE`` is properly set.
            Step 2: Configure your working conda environment (e.g. with name ``llm``) to append oneAPI path (e.g. ``~/intel/oneapi/lib``) to the environment variable ``LD_LIBRARY_PATH``.
            .. code-block:: bash
               conda env config vars set LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/intel/oneapi/lib -n llm
            .. note::
               You can view the configured environment variables for your environment (e.g. with name ``llm``) by running ``conda env config vars list -n llm``.
               You can continue with your working conda environment and install ``ipex-llm`` as guided in the next section.
            .. note::
               You are recommended not to install other pip packages in the user-defined folder for oneAPI (e.g. ``~/intel/oneapi``).
               You can uninstall the oneAPI package by simply deleting the package folder, and unsetting the configuration of your working conda environment (e.g., with name ``llm``).
               .. code-block:: bash
                  rm -r ~/intel/oneapi
                  conda env config vars unset LD_LIBRARY_PATH -n llm
         .. tab:: Offline installer
            Using the offline installer allows you to customize the installation path.
            .. code-block:: bash
               wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/20f4e6a1-6b0b-4752-b8c1-e5eacba10e01/l_BaseKit_p_2024.0.0.49564_offline.sh
               sudo sh ./l_BaseKit_p_2024.0.0.49564_offline.sh
            .. note::
                  You can also modify the installation or uninstall the package by running the following commands:
                  .. code-block:: bash
                     cd /opt/intel/oneapi/installer
                     sudo ./installer
   .. tab:: PyTorch 2.0 (deprecated for versions ``ipex-llm >= 2.1.0b20240511``)
      To enable IPEX-LLM for Intel GPUs with PyTorch 2.0, here're several prerequisite steps for tools installation and environment preparation:
      * Step 1: Install Intel GPU Driver version >= stable_775_20_20231219. Highly recommend installing the latest version of intel-i915-dkms using apt.
        .. seealso::
           Please refer to our `driver installation <https://dgpu-docs.intel.com/driver/installation.html>`_ for general purpose GPU capabilities.
           See `release page <https://dgpu-docs.intel.com/releases/index.html>`_ for latest version.
      * Step 2: Download and install `Intel® oneAPI Base Toolkit <https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html>`_ with version 2023.2. OneDNN, OneMKL and DPC++ compiler are needed, others are optional.
      Intel® oneAPI Base Toolkit 2023.2 installation methods:
      .. tabs::
         .. tab:: APT installer
            Step 1: Set up repository
            .. code-block:: bash
               wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null
               echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
               sudo apt update
            Step 2: Install the packages
            .. code-block:: bash
               sudo apt install -y intel-oneapi-common-vars=2023.2.0-49462 \
                  intel-oneapi-compiler-cpp-eclipse-cfg=2023.2.0-49495 intel-oneapi-compiler-dpcpp-eclipse-cfg=2023.2.0-49495 \
                  intel-oneapi-diagnostics-utility=2022.4.0-49091 \
                  intel-oneapi-compiler-dpcpp-cpp=2023.2.0-49495 \
                  intel-oneapi-mkl=2023.2.0-49495 intel-oneapi-mkl-devel=2023.2.0-49495 \
                  intel-oneapi-mpi=2021.10.0-49371 intel-oneapi-mpi-devel=2021.10.0-49371 \
                  intel-oneapi-tbb=2021.10.0-49541 intel-oneapi-tbb-devel=2021.10.0-49541\
                  intel-oneapi-ccl=2021.10.0-49084 intel-oneapi-ccl-devel=2021.10.0-49084\
                  intel-oneapi-dnnl-devel=2023.2.0-49516 intel-oneapi-dnnl=2023.2.0-49516
            .. note::
               You can uninstall the package by running the following command:
               .. code-block:: bash
                  sudo apt autoremove intel-oneapi-common-vars
         .. tab:: PIP installer
            Step 1: Install oneAPI in a user-defined folder, e.g., ``~/intel/oneapi``
            .. code-block:: bash
               export PYTHONUSERBASE=~/intel/oneapi
               pip install dpcpp-cpp-rt==2023.2.0 mkl-dpcpp==2023.2.0 onednn-cpu-dpcpp-gpu-dpcpp==2023.2.0 --user
            .. note::
               The oneAPI packages are visible in ``pip list`` only if ``PYTHONUSERBASE`` is properly set.
            Step 2: Configure your working conda environment (e.g. with name ``llm``) to append oneAPI path (e.g. ``~/intel/oneapi/lib``) to the environment variable ``LD_LIBRARY_PATH``.
            .. code-block:: bash
               conda env config vars set LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/intel/oneapi/lib -n llm
            .. note::
               You can view the configured environment variables for your environment (e.g. with name ``llm``) by running ``conda env config vars list -n llm``.
               You can continue with your working conda environment and install ``ipex-llm`` as guided in the next section.
            .. note::
               You are recommended not to install other pip packages in the user-defined folder for oneAPI (e.g. ``~/intel/oneapi``).
               You can uninstall the oneAPI package by simply deleting the package folder, and unsetting the configuration of your working conda environment (e.g., with name ``llm``).
               .. code-block:: bash
                  rm -r ~/intel/oneapi
                  conda env config vars unset LD_LIBRARY_PATH -n llm
         .. tab:: Offline installer
            Using the offline installer allows you to customize the installation path.
            .. code-block:: bash
               wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/992857b9-624c-45de-9701-f6445d845359/l_BaseKit_p_2023.2.0.49397_offline.sh
               sudo sh ./l_BaseKit_p_2023.2.0.49397_offline.sh
            .. note::
               You can also modify the installation or uninstall the package by running the following commands:
               .. code-block:: bash
                  cd /opt/intel/oneapi/installer
                  sudo ./installer
 ```
 ### Install IPEX-LLM
 #### Install IPEX-LLM From PyPI
 We recommend using [Miniforge](https://conda-forge.org/download/ to create a python 3.11 enviroment:
 ```eval_rst
 .. important::
   ``ipex-llm`` is tested with Python 3.9, 3.10 and 3.11. Python 3.11 is recommended for best practices.
 ```
 ```eval_rst
 .. important::
   Make sure you install matching versions of ipex-llm/pytorch/IPEX and oneAPI Base Toolkit. IPEX-LLM with Pytorch 2.1 should be used with oneAPI Base Toolkit version 2024.0. IPEX-LLM with Pytorch 2.0 should be used with oneAPI Base Toolkit version 2023.2.
 ```
 ```eval_rst
 .. tabs::
   .. tab:: PyTorch 2.1
      Choose either US or CN website for ``extra-index-url``:
      .. tabs::
         .. tab:: US
            .. code-block:: bash
               conda create -n llm python=3.11
               conda activate llm
               pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
            .. note::
               The ``xpu`` option will install IPEX-LLM with PyTorch 2.1 by default, which is equivalent to
               .. code-block:: bash
                  pip install --pre --upgrade ipex-llm[xpu_2.1] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
         .. tab:: CN
            .. code-block:: bash
               conda create -n llm python=3.11
               conda activate llm
               pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/
            .. note::
               The ``xpu`` option will install IPEX-LLM with PyTorch 2.1 by default, which is equivalent to
               .. code-block:: bash
                  pip install --pre --upgrade ipex-llm[xpu_2.1] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/
   .. tab:: PyTorch 2.0 (deprecated for versions ``ipex-llm >= 2.1.0b20240511``)
      Choose either US or CN website for ``extra-index-url``:
      .. tabs::
         .. tab:: US
            .. code-block:: bash
               conda create -n llm python=3.11
               conda activate llm
               pip install --pre --upgrade ipex-llm[xpu_2.0]==2.1.0b20240510 --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
         .. tab:: CN
            .. code-block:: bash
               conda create -n llm python=3.11
               conda activate llm
               pip install --pre --upgrade ipex-llm[xpu_2.0]==2.1.0b20240510 --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/
 ```
 #### Install IPEX-LLM From Wheel
 If you encounter network issues when installing IPEX, you can also install IPEX-LLM dependencies for Intel XPU from source archives. First you need to download and install torch/torchvision/ipex from wheels listed below before installing `ipex-llm`.
 ```eval_rst
 .. tabs::
   .. tab:: PyTorch 2.1
      .. code-block:: bash
         # get the wheels on Linux system for IPEX 2.1.10+xpu
         wget https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/torch-2.1.0a0%2Bcxx11.abi-cp311-cp311-linux_x86_64.whl
         wget https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/torchvision-0.16.0a0%2Bcxx11.abi-cp311-cp311-linux_x86_64.whl
         wget https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/intel_extension_for_pytorch-2.1.10%2Bxpu-cp311-cp311-linux_x86_64.whl
      Then you may install directly from the wheel archives using following commands:
      .. code-block:: bash
         # install the packages from the wheels
         pip install torch-2.1.0a0+cxx11.abi-cp311-cp311-linux_x86_64.whl
         pip install torchvision-0.16.0a0+cxx11.abi-cp311-cp311-linux_x86_64.whl
         pip install intel_extension_for_pytorch-2.1.10+xpu-cp311-cp311-linux_x86_64.whl
         # install ipex-llm for Intel GPU
         pip install --pre --upgrade ipex-llm[xpu]
   .. tab:: PyTorch 2.0 (deprecated for versions ``ipex-llm >= 2.1.0b20240511``)
      .. code-block:: bash
         # get the wheels on Linux system for IPEX 2.0.110+xpu
         wget https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/torch-2.0.1a0%2Bcxx11.abi-cp311-cp311-linux_x86_64.whl
         wget https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/torchvision-0.15.2a0%2Bcxx11.abi-cp311-cp311-linux_x86_64.whl
         wget https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/intel_extension_for_pytorch-2.0.110%2Bxpu-cp311-cp311-linux_x86_64.whl
      Then you may install directly from the wheel archives using following commands:
      .. code-block:: bash
         # install the packages from the wheels
         pip install torch-2.0.1a0+cxx11.abi-cp311-cp311-linux_x86_64.whl
         pip install torchvision-0.15.2a0+cxx11.abi-cp311-cp311-linux_x86_64.whl
         pip install intel_extension_for_pytorch-2.0.110+xpu-cp311-cp311-linux_x86_64.whl
         # install ipex-llm for Intel GPU
         pip install --pre --upgrade ipex-llm[xpu_2.0]==2.1.0b20240510
 ```
 ```eval_rst
 .. note::
   All the wheel packages mentioned here are for Python 3.11. If you would like to use Python 3.9 or 3.10, you should modify the wheel names for ``torch``, ``torchvision``, and ``intel_extension_for_pytorch`` by replacing ``cp11`` with ``cp39`` or ``cp310``, respectively.
 ```
 ### Runtime Configuration
 To use GPU acceleration on Linux, several environment variables are required or recommended before running a GPU example.
 ```eval_rst
 .. tabs::
   .. tab:: Intel Arc™ A-Series and Intel Data Center GPU Flex
      For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series, we recommend:
      .. code-block:: bash
         # Configure oneAPI environment variables. Required step for APT or offline installed oneAPI.
         # Skip this step for PIP-installed oneAPI since the environment has already been configured in LD_LIBRARY_PATH.
         source /opt/intel/oneapi/setvars.sh
         # Recommended Environment Variables for optimal performance
         export USE_XETLA=OFF
         export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
         export SYCL_CACHE_PERSISTENT=1
   .. tab:: Intel Data Center GPU Max
      For Intel Data Center GPU Max Series, we recommend:
      .. code-block:: bash
         # Configure oneAPI environment variables. Required step for APT or offline installed oneAPI.
         # Skip this step for PIP-installed oneAPI since the environment has already been configured in LD_LIBRARY_PATH.
         source /opt/intel/oneapi/setvars.sh
         # Recommended Environment Variables for optimal performance
         export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
         export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
         export SYCL_CACHE_PERSISTENT=1
         export ENABLE_SDP_FUSION=1
      Please note that ``libtcmalloc.so`` can be installed by ``conda install -c conda-forge -y gperftools=2.10``
   .. tab:: Intel iGPU
      .. code-block:: bash
         # Configure oneAPI environment variables. Required step for APT or offline installed oneAPI.
         # Skip this step for PIP-installed oneAPI since the environment has already been configured in LD_LIBRARY_PATH.
         source /opt/intel/oneapi/setvars.sh
         export SYCL_CACHE_PERSISTENT=1
         export BIGDL_LLM_XMX_DISABLED=1
 ```
 ```eval_rst
 .. note::
   For **the first time** that **each model** runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 ```
 ### Known issues
 #### 1. Potential suboptimal performance with Linux kernel 6.2.0
 For Ubuntu 22.04 and driver version < stable_775_20_20231219, the performance on Linux kernel 6.2.0 is worse than Linux kernel 5.19.0. You can use `sudo apt update && sudo apt install -y intel-i915-dkms intel-fw-gpu` to install the latest driver to solve this issue (need to reboot OS).
 Tips: You can use `sudo apt list --installed | grep intel-i915-dkms` to check your intel-i915-dkms's version, the version should be latest and >= `1.23.9.11.231003.15+i19-1`.
 #### 2. Driver installation unmet dependencies error: intel-i915-dkms
 The last apt install command of the driver installation may produce the following error:
 ```
 The following packages have unmet dependencies:
 intel-i915-dkms : Conflicts: intel-platform-cse-dkms
                   Conflicts: intel-platform-vsec-dkms
 ```
 You can use `sudo apt install -y intel-i915-dkms intel-fw-gpu` to install instead. As the intel-platform-cse-dkms and intel-platform-vsec-dkms are already provided by intel-i915-dkms.
 ### Troubleshooting
 #### 1. Cannot open shared object file: No such file or directory
 Error where libmkl file is not found, for example,
 ```
 OSError: libmkl_intel_lp64.so.2: cannot open shared object file: No such file or directory
 ```
 ```
 Error: libmkl_sycl_blas.so.4: cannot open shared object file: No such file or directory
 ```
 The reason for such errors is that oneAPI has not been initialized properly before running IPEX-LLM code or before importing IPEX package.
 * For oneAPI installed using APT or Offline Installer, make sure you execute `setvars.sh` of oneAPI Base Toolkit before running IPEX-LLM.
 * For PIP-installed oneAPI, activate your working environment and run ``echo $LD_LIBRARY_PATH`` to check if the installation path is properly configured for the environment. If the output does not contain oneAPI path (e.g. ``~/intel/oneapi/lib``), check [Prerequisites](#id1) to re-install oneAPI with PIP installer.
 * Make sure you install matching versions of ipex-llm/pytorch/IPEX and oneAPI Base Toolkit. IPEX-LLM with PyTorch 2.1 should be used with oneAPI Base Toolkit version 2024.0. IPEX-LLM with PyTorch 2.0 should be used with oneAPI Base Toolkit version 2023.2.
--- a/docs/mddocs/Overview/known_issues.md
+++ b/docs/mddocs/Overview/known_issues.md
@ -0,0 +1 @@
 # IPEX-LLM Known Issues
--- a/docs/mddocs/Overview/llm.md
+++ b/docs/mddocs/Overview/llm.md
@ -0,0 +1,68 @@
 # IPEX-LLM in 5 minutes
 You can use IPEX-LLM to run any [*Hugging Face Transformers*](https://huggingface.co/docs/transformers/index) PyTorch model. It automatically optimizes and accelerates LLMs using low-precision (INT4/INT5/INT8) techniques, modern hardware accelerations and latest software optimizations.
 Hugging Face transformers-based applications can run on IPEX-LLM with one-line code change, and you'll immediately observe significant speedup<sup><a href="#footnote-perf" id="ref-perf">[1]</a></sup>.
 Here, let's take a relatively small LLM model, i.e [open_llama_3b_v2](https://huggingface.co/openlm-research/open_llama_3b_v2), and IPEX-LLM INT4 optimizations as an example.
 ## Load a Pretrained Model
 Simply use one-line `transformers`-style API in `ipex-llm` to load `open_llama_3b_v2` with INT4 optimization (by specifying `load_in_4bit=True`) as follows:
 ```python
 from ipex_llm.transformers import AutoModelForCausalLM
 model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path="openlm-research/open_llama_3b_v2",
                                             load_in_4bit=True)
 ```
 ```eval_rst
 .. tip::
   `open_llama_3b_v2 <https://huggingface.co/openlm-research/open_llama_3b_v2>`_ is a pretrained large language model hosted on Hugging Face. ``openlm-research/open_llama_3b_v2`` is its Hugging Face model id. ``from_pretrained`` will automatically download the model from Hugging Face to a local cache path (e.g. ``~/.cache/huggingface``), load the model, and converted it to ``ipex-llm`` INT4 format.
   It may take a long time to download the model using API. You can also download the model yourself, and set ``pretrained_model_name_or_path`` to the local path of the downloaded model. This way, ``from_pretrained`` will load and convert directly from local path without download.
 ```
 ## Load Tokenizer
 You also need a tokenizer for inference. Just use the official `transformers` API to load `LlamaTokenizer`:
 ```python
 from transformers import LlamaTokenizer
 tokenizer = LlamaTokenizer.from_pretrained(pretrained_model_name_or_path="openlm-research/open_llama_3b_v2")
 ```
 ## Run LLM
 Now you can do model inference exactly the same way as using official `transformers` API:
 ```python
 import torch
 with torch.inference_mode():
    prompt = 'Q: What is CPU?\nA:'
    # tokenize the input prompt from string to token ids
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    # predict the next tokens (maximum 32) based on the input token ids
    output = model.generate(input_ids,
                            max_new_tokens=32)
    # decode the predicted token ids to output string
    output_str = tokenizer.decode(output[0], skip_special_tokens=True)
    print(output_str)
 ```
 ------
 <div>
    <p>
        <sup><a href="#ref-perf" id="footnote-perf">[1]</a>
            Performance varies by use, configuration and other factors. <code><span>ipex-llm</span></code> may not optimize to the same degree for non-Intel products. Learn more at <a href="https://www.Intel.com/PerformanceIndex">www.Intel.com/PerformanceIndex</a>.
        </sup>
    </p>
 </div>
--- a/docs/mddocs/Quickstart/axolotl_quickstart.md
+++ b/docs/mddocs/Quickstart/axolotl_quickstart.md
@ -0,0 +1,314 @@
 # Finetune LLM with Axolotl on Intel GPU
 [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) is a popular tool designed to streamline the fine-tuning of various AI models, offering support for multiple configurations and architectures. You can now use [`ipex-llm`](https://github.com/intel-analytics/ipex-llm) as an accelerated backend for `Axolotl` running on Intel **GPU** *(e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max)*.
 See the demo of finetuning LLaMA2-7B on Intel Arc GPU below.
 <video src="https://llm-assets.readthedocs.io/en/latest/_images/axolotl-qlora-linux-arc.mp4" width="100%" controls></video>
 ## Quickstart
 ### 0. Prerequisites
 IPEX-LLM's support for [Axolotl v0.4.0](https://github.com/OpenAccess-AI-Collective/axolotl/tree/v0.4.0) is only available for Linux system. We recommend Ubuntu 20.04 or later (Ubuntu 22.04 is preferred).
 Visit the [Install IPEX-LLM on Linux with Intel GPU](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html), follow [Install Intel GPU Driver](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#install-intel-gpu-driver) and [Install oneAPI](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#install-oneapi) to install GPU driver and Intel® oneAPI Base Toolkit 2024.0.
 ### 1. Install IPEX-LLM for Axolotl
 Create a new conda env, and install `ipex-llm[xpu]`.
 ```cmd
 conda create -n axolotl python=3.11
 conda activate axolotl
 # install ipex-llm
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 ```
 Install [axolotl v0.4.0](https://github.com/OpenAccess-AI-Collective/axolotl/tree/v0.4.0) from git.
 ```cmd
 # install axolotl v0.4.0
 git clone https://github.com/OpenAccess-AI-Collective/axolotl/tree/v0.4.0
 cd axolotl
 # replace requirements.txt
 remove requirements.txt
 wget -O requirements.txt https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/requirements-xpu.txt
 pip install -e .
 pip install transformers==4.36.0
 # to avoid https://github.com/OpenAccess-AI-Collective/axolotl/issues/1544
 pip install datasets==2.15.0
 # prepare axolotl entrypoints
 wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/finetune.py
 wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/train.py
 ```
 **After the installation, you should have created a conda environment, named `axolotl` for instance, for running `Axolotl` commands with IPEX-LLM.**
 ### 2. Example: Finetune Llama-2-7B with Axolotl
 The following example will introduce finetuning [Llama-2-7B](https://huggingface.co/meta-llama/Llama-2-7b) with [alpaca_2k_test](https://huggingface.co/datasets/mhenrichsen/alpaca_2k_test) dataset using LoRA and QLoRA.
 Note that you don't need to write any code in this example.
 | Model | Dataset | Finetune method |
 |-------|-------|-------|
 | Llama-2-7B | alpaca_2k_test | LoRA (Low-Rank Adaptation)  |
 | Llama-2-7B | alpaca_2k_test | QLoRA (Quantized Low-Rank Adaptation) |
 For more technical details, please refer to [Llama 2](https://arxiv.org/abs/2307.09288), [LoRA](https://arxiv.org/abs/2106.09685) and [QLoRA](https://arxiv.org/abs/2305.14314).
 #### 2.1 Download Llama-2-7B and alpaca_2k_test
 By default, Axolotl will automatically download models and datasets from Huggingface. Please ensure you have login to Huggingface.
 ```cmd
 huggingface-cli login
 ```
 If you prefer offline models and datasets, please download [Llama-2-7B](https://huggingface.co/meta-llama/Llama-2-7b) and [alpaca_2k_test](https://huggingface.co/datasets/mhenrichsen/alpaca_2k_test). Then, set `HF_HUB_OFFLINE=1` to avoid connecting to Huggingface.
 ```cmd
 export HF_HUB_OFFLINE=1
 ```
 #### 2.2 Set Environment Variables
 ```eval_rst
 .. note::
   This is a required step on for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI.
 ```
 Configure oneAPI variables by running the following command:
 ```eval_rst
 .. tabs::
   .. tab:: Linux
      .. code-block:: bash
         source /opt/intel/oneapi/setvars.sh
 ```
 Configure accelerate to avoid training with CPU. You can download a default `default_config.yaml` with `use_cpu: false`.
 ```cmd
 mkdir -p  ~/.cache/huggingface/accelerate/
 wget -O ~/.cache/huggingface/accelerate/default_config.yaml https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/default_config.yaml
 ```
 As an alternative, you can config accelerate based on your requirements.
 ```cmd
 accelerate config
 ```
 Please answer `NO` in option `Do you want to run your training on CPU only (even if a GPU / Apple Silicon device is available)? [yes/NO]:`.
 After finishing accelerate config, check if `use_cpu` is disabled (i.e., `use_cpu: false`) in accelerate config file (`~/.cache/huggingface/accelerate/default_config.yaml`).
 #### 2.3 LoRA finetune
 Prepare `lora.yml` for Axolotl LoRA finetune. You can download a template from github.
 ```cmd
 wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/lora.yml
 ```
 **If you are using the offline model and dataset in local env**, please modify the model path and dataset path in `lora.yml`. Otherwise, keep them unchanged.
 ```yaml
 # Please change to local path if model is offline, e.g., /path/to/model/Llama-2-7b-hf
 base_model: NousResearch/Llama-2-7b-hf
 datasets:
  # Please change to local path if dataset is offline, e.g., /path/to/dataset/alpaca_2k_test
  - path: mhenrichsen/alpaca_2k_test
    type: alpaca
 ```
 Modify LoRA parameters, such as `lora_r` and `lora_alpha`, etc.
 ```yaml
 adapter: lora
 lora_model_dir:
 lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_linear: true
 lora_fan_in_fan_out:
 ```
 Launch LoRA training with the following command.
 ```cmd
 accelerate launch finetune.py lora.yml
 ```
 In Axolotl v0.4.0, you can use `train.py` instead of `-m axolotl.cli.train` or `finetune.py`.
 ```cmd
 accelerate launch train.py lora.yml
 ```
 #### 2.4 QLoRA finetune
 Prepare `lora.yml` for QLoRA finetune. You can download a template from github.
 ```cmd
 wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/qlora.yml
 ```
 **If you are using the offline model and dataset in local env**, please modify the model path and dataset path in `qlora.yml`. Otherwise, keep them unchanged.
 ```yaml
 # Please change to local path if model is offline, e.g., /path/to/model/Llama-2-7b-hf
 base_model: NousResearch/Llama-2-7b-hf
 datasets:
  # Please change to local path if dataset is offline, e.g., /path/to/dataset/alpaca_2k_test
  - path: mhenrichsen/alpaca_2k_test
    type: alpaca
 ```
 Modify QLoRA parameters, such as `lora_r` and `lora_alpha`, etc.
 ```yaml
 adapter: qlora
 lora_model_dir:
 lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_modules:
 lora_target_linear: true
 lora_fan_in_fan_out:
 ```
 Launch LoRA training with the following command.
 ```cmd
 accelerate launch finetune.py qlora.yml
 ```
 In Axolotl v0.4.0, you can use `train.py` instead of `-m axolotl.cli.train` or `finetune.py`.
 ```cmd
 accelerate launch train.py qlora.yml
 ```
 ### 3. Finetune Llama-3-8B (Experimental)
 Warning: this section will install axolotl main ([796a085](https://github.com/OpenAccess-AI-Collective/axolotl/tree/796a085b2f688f4a5efe249d95f53ff6833bf009)) for new features, e.g., Llama-3-8B.
 #### 3.1 Install Axolotl main in conda
 Axolotl main has lots of new dependencies. Please setup a new conda env for this version.
 ```cmd
 conda create -n llm python=3.11
 conda activate llm
 # install axolotl main
 git clone https://github.com/OpenAccess-AI-Collective/axolotl
 cd axolotl && git checkout 796a085
 pip install -e .
 # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 # install transformers etc
 # to avoid https://github.com/OpenAccess-AI-Collective/axolotl/issues/1544
 pip install datasets==2.15.0
 pip install transformers==4.37.0
 ```
 Config accelerate and oneAPIs, according to [Set Environment Variables](#22-set-environment-variables).
 #### 3.2 Alpaca QLoRA
 Based on [axolotl Llama-3 QLoRA example](https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/examples/llama-3/qlora.yml).
 Prepare `llama3-qlora.yml` for QLoRA finetune. You can download a template from github.
 ```cmd
 wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/LLM-Finetuning/axolotl/llama3-qlora.yml
 ```
 **If you are using the offline model and dataset in local env**, please modify the model path and dataset path in `llama3-qlora.yml`. Otherwise, keep them unchanged.
 ```yaml
 # Please change to local path if model is offline, e.g., /path/to/model/Meta-Llama-3-8B
 base_model: meta-llama/Meta-Llama-3-8B
 datasets:
  # Please change to local path if dataset is offline, e.g., /path/to/dataset/alpaca_2k_test
  - path: aaditya/alpaca_subset_1
    type: alpaca
 ```
 Modify QLoRA parameters, such as `lora_r` and `lora_alpha`, etc.
 ```yaml
 adapter: qlora
 lora_model_dir:
 sequence_len: 256
 sample_packing: true
 pad_to_sequence_len: true
 lora_r: 32
 lora_alpha: 16
 lora_dropout: 0.05
 lora_target_modules:
 lora_target_linear: true
 lora_fan_in_fan_out:
 ```
 ```cmd
 accelerate launch finetune.py llama3-qlora.yml
 ```
 You can also use `train.py` instead of `-m axolotl.cli.train` or `finetune.py`.
 ```cmd
 accelerate launch train.py llama3-qlora.yml
 ```
 Expected output
 ```cmd
 {'loss': 0.237, 'learning_rate': 1.2254711850265387e-06, 'epoch': 3.77}
 {'loss': 0.6068, 'learning_rate': 1.1692453482951115e-06, 'epoch': 3.77}
 {'loss': 0.2926, 'learning_rate': 1.1143322458989303e-06, 'epoch': 3.78}
 {'loss': 0.2475, 'learning_rate': 1.0607326072295087e-06, 'epoch': 3.78}
 {'loss': 0.1531, 'learning_rate': 1.008447144232094e-06, 'epoch': 3.79}
 {'loss': 0.1799, 'learning_rate': 9.57476551396197e-07, 'epoch': 3.79}
 {'loss': 0.2724, 'learning_rate': 9.078215057463868e-07, 'epoch': 3.79}
 {'loss': 0.2534, 'learning_rate': 8.594826668332445e-07, 'epoch': 3.8}
 {'loss': 0.3388, 'learning_rate': 8.124606767246579e-07, 'epoch': 3.8}
 {'loss': 0.3867, 'learning_rate': 7.667561599972505e-07, 'epoch': 3.81}
 {'loss': 0.2108, 'learning_rate': 7.223697237281668e-07, 'epoch': 3.81}
 {'loss': 0.0792, 'learning_rate': 6.793019574868775e-07, 'epoch': 3.82}
 ```
 ## Troubleshooting
 #### TypeError: PosixPath
 Error message: `TypeError: argument of type 'PosixPath' is not iterable`
 This issue is related to [axolotl #1544](https://github.com/OpenAccess-AI-Collective/axolotl/issues/1544). It can be fixed by downgrading datasets to 2.15.0.
 ```cmd
 pip install datasets==2.15.0
 ```
 #### RuntimeError: out of device memory
 Error message: `RuntimeError: Allocation is out of device memory on current platform.`
 This issue is caused by running out of GPU memory. Please reduce `lora_r` or `micro_batch_size` in `qlora.yml` or `lora.yml`, or reduce data using in training.
 #### OSError: libmkl_intel_lp64.so.2
 Error message: `OSError: libmkl_intel_lp64.so.2: cannot open shared object file: No such file or directory`
 oneAPI environment is not correctly set. Please refer to [Set Environment Variables](#set-environment-variables).
--- a/docs/mddocs/Quickstart/benchmark_quickstart.md
+++ b/docs/mddocs/Quickstart/benchmark_quickstart.md
@ -0,0 +1,174 @@
 # Run Performance Benchmarking with IPEX-LLM
 We can perform benchmarking for IPEX-LLM on Intel CPUs and GPUs using the benchmark scripts we provide.
 ## Prepare The Environment
 You can refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install.html) to install IPEX-LLM in your environment. The following dependencies are also needed to run the benchmark scripts.
 ```
 pip install pandas
 pip install omegaconf
 ```
 ## Prepare The Scripts
 Navigate to your local workspace and then download IPEX-LLM from GitHub. Modify the `config.yaml` under `all-in-one` folder for your benchmark configurations.
 ```
 cd your/local/workspace
 git clone https://github.com/intel-analytics/ipex-llm.git
 cd ipex-llm/python/llm/dev/benchmark/all-in-one/
 ```
 ## config.yaml
 ```yaml
 repo_id:
  - 'meta-llama/Llama-2-7b-chat-hf'
 local_model_hub: 'path to your local model hub'
 warm_up: 1 # must set >=2 when run "pipeline_parallel_gpu" test_api
 num_trials: 3
 num_beams: 1 # default to greedy search
 low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4)
 batch_size: 1 # default to 1
 in_out_pairs:
  - '32-32'
  - '1024-128'
  - '2048-256'
 test_api:
  - "transformer_int4_gpu"   # on Intel GPU, transformer-like API, (qtype=int4)
 cpu_embedding: False # whether put embedding to CPU
 streaming: False # whether output in streaming way (only avaiable now for gpu win related test_api)
 task: 'continuation' # task can be 'continuation', 'QA' and 'summarize'
 ```
 Some parameters in the yaml file that you can configure:
 - `repo_id`: The name of the model and its organization.
 - `local_model_hub`: The folder path where the models are stored on your machine. Replace 'path to your local model hub' with /llm/models.
 - `warm_up`: The number of warmup trials before performance benchmarking (must set to >= 2 when using "pipeline_parallel_gpu" test_api).
 - `num_trials`: The number of runs for performance benchmarking (the final result is the average of all trials).
 - `low_bit`: The low_bit precision you want to convert to for benchmarking.
 - `batch_size`: The number of samples on which the models make predictions in one forward pass.
 - `in_out_pairs`: Input sequence length and output sequence length combined by '-'.
 - `test_api`: Different test functions for different machines.
  - `transformer_int4_gpu` on Intel GPU for Linux
  - `transformer_int4_gpu_win` on Intel GPU for Windows
  - `transformer_int4` on Intel CPU
 - `cpu_embedding`: Whether to put embedding on CPU (only available for windows GPU-related test_api).
 - `streaming`: Whether to output in a streaming way (only available for GPU Windows-related test_api).
 - `use_fp16_torch_dtype`: Whether to use fp16 for the non-linear layer (only available for "pipeline_parallel_gpu" test_api).
 - `n_gpu`: Number of GPUs to use (only available for "pipeline_parallel_gpu" test_api).
 - `task`: There are three tasks: `continuation`, `QA` and `summarize`. `continuation` refers to writing additional content based on prompt. `QA` refers to answering questions based on prompt. `summarize` refers to summarizing the prompt.
 ```eval_rst
 .. note::
  If you want to benchmark the performance without warmup, you can set ``warm_up: 0`` and ``num_trials: 1`` in ``config.yaml``, and run each single model and in_out_pair separately. 
 ```
 ## Run on Windows
 Please refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#runtime-configuration) to configure oneAPI environment variables.
 ```eval_rst
 .. tabs::
   .. tab:: Intel iGPU
      .. code-block:: bash
         set SYCL_CACHE_PERSISTENT=1
         set BIGDL_LLM_XMX_DISABLED=1
         python run.py
   .. tab:: Intel Arc™ A300-Series or Pro A60
      .. code-block:: bash
         set SYCL_CACHE_PERSISTENT=1
         python run.py
   .. tab:: Other Intel dGPU Series
      .. code-block:: bash
         # e.g. Arc™ A770
         python run.py
 ```
 ## Run on Linux
 ```eval_rst
 .. tabs::
   .. tab:: Intel Arc™ A-Series and Intel Data Center GPU Flex
      For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series, we recommend:
      .. code-block:: bash
         ./run-arc.sh
   .. tab:: Intel iGPU
      For Intel iGPU, we recommend:
      .. code-block:: bash
         ./run-igpu.sh
   .. tab:: Intel Data Center GPU Max
      Please note that you need to run ``conda install -c conda-forge -y gperftools=2.10`` before running the benchmark script on Intel Data Center GPU Max Series.
      .. code-block:: bash
         ./run-max-gpu.sh
   .. tab:: Intel SPR
      For Intel SPR machine, we recommend:
      .. code-block:: bash
         ./run-spr.sh
      The scipt uses a default numactl strategy. If you want to customize it, please use ``lscpu`` or ``numactl -H`` to check how cpu indexs are assigned to numa node, and make sure the run command is binded to only one socket.
   .. tab:: Intel HBM
      For Intel HBM machine, we recommend:
      .. code-block:: bash
         ./run-hbm.sh
      The scipt uses a default numactl strategy. If you want to customize it, please use ``numactl -H`` to check how the index of hbm node and cpu are assigned.
      For example:
      .. code-block:: bash
         node   0   1   2   3
            0:  10  21  13  23
            1:  21  10  23  13
            2:  13  23  10  23
            3:  23  13  23  10
      here hbm node is the node whose distance from the checked node is 13, node 2 is node 0's hbm node.
      And make sure the run command is binded to only one socket.
 ```
 ## Result
 After the benchmarking is completed, you can obtain a CSV result file under the current folder. You can mainly look at the results of columns `1st token avg latency (ms)` and `2+ avg latency (ms/token)` for the benchmark results. You can also check whether the column `actual input/output tokens` is consistent with the column `input/output tokens` and whether the parameters you specified in `config.yaml` have been successfully applied in the benchmarking.
--- a/docs/mddocs/Quickstart/bigdl_llm_migration.md
+++ b/docs/mddocs/Quickstart/bigdl_llm_migration.md
@ -0,0 +1,63 @@
 # `bigdl-llm` Migration Guide
 This guide helps you migrate your `bigdl-llm` application to use `ipex-llm`.
 ## Upgrade `bigdl-llm` package to `ipex-llm`
 ```eval_rst
 .. note::
   This step assumes you have already installed `bigdl-llm`.
 ```
 You need to uninstall `bigdl-llm` and install `ipex-llm`With your `bigdl-llm` conda environment activated, execute the following command according to your device type and location:
 ### For CPU
 ```bash
 pip uninstall -y bigdl-llm
 pip install --pre --upgrade ipex-llm[all] # for cpu
 ```
 ### For GPU
 Choose either US or CN website for `extra-index-url`:
 ```eval_rst
 .. tabs::
   .. tab:: US
      .. code-block:: cmd
         pip uninstall -y bigdl-llm
         pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
   .. tab:: CN
      .. code-block:: cmd
         pip uninstall -y bigdl-llm
         pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/
 ```
 ## Migrate `bigdl-llm` code to `ipex-llm`
 There are two options to migrate `bigdl-llm` code to `ipex-llm`.
 ### 1. Upgrade `bigdl-llm` code to `ipex-llm`
 To upgrade `bigdl-llm` code to `ipex-llm`, simply replace all `bigdl.llm` with `ipex_llm`:
 ```python
 #from bigdl.llm.transformers import AutoModelForCausalLM # Original line
 from ipex_llm.transformers import AutoModelForCausalLM #Updated line
 model = AutoModelForCausalLM.from_pretrained(model_path,
                                             load_in_4bit=True,
                                             trust_remote_code=True)
 ```
 ### 2. Run `bigdl-llm` code in compatible mode (experimental)
 To run in the compatible mode, simply add `import ipex_llm` at the beginning of the existing `bigdl-llm` code:
 ```python
 import ipex_llm # Add this line before any bigdl.llm imports
 from bigdl.llm.transformers import AutoModelForCausalLM
 model = AutoModelForCausalLM.from_pretrained(model_path,
                                             load_in_4bit=True,
                                             trust_remote_code=True)
 ```
--- a/docs/mddocs/Quickstart/chatchat_quickstart.md
+++ b/docs/mddocs/Quickstart/chatchat_quickstart.md
@ -0,0 +1,82 @@
 # Run Local RAG using Langchain-Chatchat on Intel CPU and GPU
 [chatchat-space/Langchain-Chatchat](https://github.com/chatchat-space/Langchain-Chatchat) is a Knowledge Base QA application using RAG pipeline; by porting it to [`ipex-llm`](https://github.com/intel-analytics/ipex-llm), users can now easily run ***local RAG pipelines*** using [Langchain-Chatchat](https://github.com/intel-analytics/Langchain-Chatchat) with LLMs and Embedding models on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max).
 *See the demos of running LLaMA2-7B (English) and ChatGLM-3-6B (Chinese) on an Intel Core Ultra laptop below.*
 <table border="1" width="100%">
  <tr>
    <td align="center" width="50%">English</td>
    <td align="center" width="50%">简体中文</td>
  </tr>
  <tr>
    <td><video src="https://llm-assets.readthedocs.io/en/latest/_images/langchain-chatchat-en.mp4" width="100%" controls></video></td>
    <td><video src="https://llm-assets.readthedocs.io/en/latest/_images/langchain-chatchat-cn.mp4" width="100%" controls></video></td>
 </tr>
 </table>
 >You can change the UI language in the left-side menu. We currently support **English** and **简体中文** (see video demos below).
 ## Langchain-Chatchat Architecture
 See the Langchain-Chatchat architecture below ([source](https://github.com/chatchat-space/Langchain-Chatchat/blob/master/img/langchain%2Bchatglm.png)).
 <img src="https://llm-assets.readthedocs.io/en/latest/_images/langchain-arch.png" height="50%" />
 ## Quickstart
 ### Install and Run
 Follow the guide that corresponds to your specific system and device from the links provided below:
 - For systems with Intel Core Ultra integrated GPU: [Windows Guide](https://github.com/intel-analytics/Langchain-Chatchat/blob/ipex-llm/INSTALL_win_mtl.md#) | [Linux Guide](https://github.com/intel-analytics/Langchain-Chatchat/blob/ipex-llm/INSTALL_linux_mtl.md#)
 - For systems with Intel Arc A-Series GPU: [Windows Guide](https://github.com/intel-analytics/Langchain-Chatchat/blob/ipex-llm/INSTALL_win_arc.md#) | [Linux Guide](https://github.com/intel-analytics/Langchain-Chatchat/blob/ipex-llm/INSTALL_linux_arc.md#)
 - For systems with Intel Data Center Max Series GPU: [Linux Guide](https://github.com/intel-analytics/Langchain-Chatchat/blob/ipex-llm/INSTALL_linux_max.md#)
 - For systems with Xeon-Series CPU: [Linux Guide](https://github.com/intel-analytics/Langchain-Chatchat/blob/ipex-llm/INSTALL_linux_xeon.md#)
 ### How to use RAG
 #### Step 1: Create Knowledge Base
 - Select `Manage Knowledge Base` from the menu on the left, then choose `New Knowledge Base` from the dropdown menu on the right side.
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/new-kb.png" target="_blank">
    <img src="https://llm-assets.readthedocs.io/en/latest/_images/new-kb.png" alt="rag-menu" width="100%" align="center">
  </a>
 - Fill in the name of your new knowledge base (example: "test") and press the `Create` button. Adjust any other settings as needed.
  <a href="https://llm-assets.readthedocs.io/en/latest/_images/create-kb.png" target="_blank">
    <img src="https://llm-assets.readthedocs.io/en/latest/_images/create-kb.png" alt="rag-menu" width="100%" align="center">
  </a>
 - Upload knowledge files from your computer and allow some time for the upload to complete. Once finished, click on `Add files to Knowledge Base` button to build the vector store. Note: this process may take several minutes.
  <a href="https://llm-assets.readthedocs.io/en/latest/_images/build-kb.png" target="_blank">
    <img src="https://llm-assets.readthedocs.io/en/latest/_images/build-kb.png" alt="rag-menu" width="100%" align="center">
  </a>
 #### Step 2: Chat with RAG
 You can now click `Dialogue` on the left-side menu to return to the chat UI. Then in `Knowledge base settings` menu, choose the Knowledge Base you just created, e.g, "test". Now you can start chatting.
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/rag-menu.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/rag-menu.png" alt="rag-menu" width="100%" align="center">
 </a>
 <br/>
 For more information about how to use Langchain-Chatchat, refer to Official Quickstart guide in [English](./README_en.md#), [Chinese](./README_chs.md#), or the [Wiki](https://github.com/chatchat-space/Langchain-Chatchat/wiki/).
 ### Trouble Shooting & Tips
 #### 1. Version Compatibility
 Ensure that you have installed `ipex-llm>=2.1.0b20240327`. To upgrade `ipex-llm`, use
 ```bash
 pip install --pre --upgrade ipex-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 ```
 #### 2. Prompt Templates
 In the left-side menu, you have the option to choose a prompt template. There're several pre-defined templates - those ending with '_cn' are Chinese templates, and those ending with '_en' are English templates. You can also define your own prompt templates in `configs/prompt_config.py`. Remember to restart the service to enable these changes.
--- a/docs/mddocs/Quickstart/continue_quickstart.md
+++ b/docs/mddocs/Quickstart/continue_quickstart.md
@ -0,0 +1,169 @@
 # Run Coding Copilot in VSCode with Intel GPU
 [**Continue**](https://marketplace.visualstudio.com/items?itemName=Continue.continue) is a coding copilot extension in [Microsoft Visual Studio Code](https://code.visualstudio.com/); by integrating it with [`ipex-llm`](https://github.com/intel-analytics/ipex-llm), users can now easily leverage local LLMs running on Intel GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) for code explanation, code generation/completion, etc.
 Below is a demo of using `Continue` with [CodeQWen1.5-7B](https://huggingface.co/Qwen/CodeQwen1.5-7B-Chat) running on Intel A770 GPU. This demo illustrates how a programmer used `Continue` to find a solution for the [Kaggle's _Titanic_ challenge](https://www.kaggle.com/competitions/titanic/), which involves asking `Continue` to complete the code for model fitting, evaluation, hyper parameter tuning, feature engineering, and explain generated code.
 <video src="https://llm-assets.readthedocs.io/en/latest/_images/continue_demo_ollama_backend_arc.mp4" width="100%" controls></video>
 ## Quickstart
 This guide walks you through setting up and running **Continue** within _Visual Studio Code_, empowered by local large language models served via [Ollama](./ollama_quickstart.html) with `ipex-llm` optimizations.
 ### 1. Install and Run Ollama Serve
 Visit [Run Ollama with IPEX-LLM on Intel GPU](./ollama_quickstart.html), and follow the steps 1) [Install IPEX-LLM for Ollama](./ollama_quickstart.html#install-ipex-llm-for-ollama), 2) [Initialize Ollama](./ollama_quickstart.html#initialize-ollama) 3) [Run Ollama Serve](./ollama_quickstart.html#run-ollama-serve) to install, init and start the Ollama Service. 
 ```eval_rst
 .. important::
   If the `Continue` plugin is not installed on the same machine where Ollama is running (which means `Continue` needs to connect to a remote Ollama service), you must configure the Ollama service to accept connections from any IP address. To achieve this, set or export the environment variable `OLLAMA_HOST=0.0.0.0` before executing the command `ollama serve`. 
 .. tip::
  If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before executing `ollama serve`:
  .. code-block:: bash
      export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 ```
 ### 2. Pull and Prepare the Model
 #### 2.1 Pull Model 
 Now we need to pull a model for coding. Here we use [CodeQWen1.5-7B](https://huggingface.co/Qwen/CodeQwen1.5-7B-Chat) model as an example. Open a new terminal window, run the following command to pull [`codeqwen:latest`](https://ollama.com/library/codeqwen). 
 ```eval_rst
 .. tabs::
   .. tab:: Linux
      .. code-block:: bash
         export no_proxy=localhost,127.0.0.1
         ./ollama pull codeqwen:latest
   .. tab:: Windows
      Please run the following command in Miniforge Prompt.
      .. code-block:: cmd
         set no_proxy=localhost,127.0.0.1
         ollama pull codeqwen:latest
 .. seealso::
   Besides CodeQWen, there are other coding models you might want to explore, such as Magicoder, Wizardcoder, Codellama, Codegemma, Starcoder, Starcoder2, and etc. You can find these models in the `Ollama model library <https://ollama.com/library>`_. Simply search for the model, pull it in a similar manner, and give it a try.
 ```
 #### 2.2 Prepare the Model and Pre-load
 To make `Continue` run more smoothly with Ollama, we will create a new model in Ollama using the original model with an adjusted num_ctx parameter of 4096.
 Start by creating a file named `Modelfile` with the following content:
 ```dockerfile
 FROM codeqwen:latest
 PARAMETER num_ctx 4096
 ```
 Next, use the following commands in the terminal (Linux) or Miniforge Prompt (Windows) to create a new model in Ollama named `codeqwen:latest-continue`:
 ```bash
 ollama create codeqwen:latest-continue -f Modelfile
 ```
 After creation, run `ollama list` to see `codeqwen:latest-continue` in the list of models.
 Finally, preload the new model by executing the following command in a new terminal (Linux) or Miniforge Prompt (Windows):
 ```bash
 ollama run codeqwen:latest-continue
 ```
 ### 3. Install `Continue` Extension
 Search for `Continue` in the VSCode `Extensions Marketplace` and install it just like any other extension.
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/continue_install.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/continue_install.png" width=100%; />
 </a>
 <br/>
 Once installed, the `Continue` icon will appear on the left sidebar. You can drag and drop the icon to the right sidebar for easy access to the `Continue` view.
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/continue_dragdrop.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/continue_dragdrop.png" width=100%; />
 </a>
 <br/>
 If the icon does not appear or you cannot open the view, press `Ctrl+Shift+L` or follow the steps below to open the `Continue` view on the right side.
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/continue_openview.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/continue_openview.png" width=100%; />
 </a>
 <br/>
 Once you have successfully opened the `Continue` view, you will see the welcome screen as shown below. Select **Fully local** -> **Continue** -> **Continue** as illustrated.
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/continue_welcome.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/continue_welcome.png" width=100%; />
 </a>
 When you see the screen below, your plug-in is ready to use.
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/continue_ready.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/continue_ready.png" width=100%; />
 </a>
 ### 4. `Continue` Configuration
 Once `Continue` is installed and ready, simply select the model "`Ollama - codeqwen:latest-continue`" from the bottom of the `Continue` view (all models in `ollama list` will appear in the format `Ollama-xxx`).
 Now you can start using `Continue`. 
 #### Connecting to Remote Ollama Service
 You can configure `Continue` by clicking the small gear icon located at the bottom right of the `Continue` view to open `config.json`. In `config.json`, you will find all necessary configuration settings. 
 If you are running Ollama on the same machine as `Continue`, no changes are necessary. If Ollama is running on a different machine, you'll need to update the `apiBase` key in `Ollama` item in `config.json` to point to the remote Ollama URL, as shown in the example below and in the figure. 
 ```json
    {
      "title": "Ollama",
      "provider": "ollama",
      "model": "AUTODETECT",
      "apiBase": "http://your-ollama-service-ip:11434"
    }
 ```
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/continue_config.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/continue_config.png" width=100%; />
 </a>
 ### 5. How to Use `Continue`
 For detailed tutorials please refer to [this link](https://continue.dev/docs/how-to-use-continue). Here we are only showing the most common scenarios.
 #### Q&A over specific code
 If you don't understand how some code works, highlight(press `Ctrl+Shift+L`) it and ask "how does this code work?"
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/continue_quickstart_sample_usage1.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/continue_quickstart_sample_usage1.png" width=100%; />
 </a>
 #### Editing code
 You can ask Continue to edit your highlighted code with the command `/edit`.
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/continue_quickstart_sample_usage2.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/continue_quickstart_sample_usage2.png" width=100%; />
 </a>
--- a/docs/mddocs/Quickstart/deepspeed_autotp_fastapi_quickstart.md
+++ b/docs/mddocs/Quickstart/deepspeed_autotp_fastapi_quickstart.md
@ -0,0 +1,102 @@
 # Run IPEX-LLM serving on Multiple Intel GPUs using DeepSpeed AutoTP and FastApi
 This example demonstrates how to run IPEX-LLM serving on multiple [Intel GPUs](../README.md) by leveraging DeepSpeed AutoTP.
 ## Requirements
 To run this example with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. For this particular example, you will need at least two GPUs on your machine.
 ## Example
 ### 1. Install
 ```bash
 conda create -n llm python=3.11
 conda activate llm
 # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 pip install oneccl_bind_pt==2.1.100 --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 # configures OneAPI environment variables
 source /opt/intel/oneapi/setvars.sh
 pip install git+https://github.com/microsoft/DeepSpeed.git@ed8aed5
 pip install git+https://github.com/intel/intel-extension-for-deepspeed.git@0eb734b
 pip install mpi4py fastapi uvicorn
 conda install -c conda-forge -y gperftools=2.10 # to enable tcmalloc
 ```
 > **Important**: IPEX 2.1.10+xpu requires Intel® oneAPI Base Toolkit's version == 2024.0. Please make sure you have installed the correct version.
 ### 2. Run tensor parallel inference on multiple GPUs
 When we run the model in a distributed manner across two GPUs, the memory consumption of each GPU is only half of what it was originally, and the GPUs can work simultaneously during inference computation.
 We provide example usage for `Llama-2-7b-chat-hf` model running on Arc A770
 Run Llama-2-7b-chat-hf on two Intel Arc A770:
 ```bash
 # Before run this script, you should adjust the YOUR_REPO_ID_OR_MODEL_PATH in last line
 # If you want to change server port, you can set port parameter in last line
 # To avoid GPU OOM, you could adjust --max-num-seqs and --max-num-batched-tokens parameters in below script
 bash run_llama2_7b_chat_hf_arc_2_card.sh
 ```
 If you successfully run the serving, you can get output like this:
 ```bash
 [0] INFO:     Started server process [120071]
 [0] INFO:     Waiting for application startup.
 [0] INFO:     Application startup complete.
 [0] INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
 ```
 > **Note**: You could change `NUM_GPUS` to the number of GPUs you have on your machine. And you could also specify other low bit optimizations through `--low-bit`.
 ### 3. Sample Input and Output
 We can use `curl` to test serving api
 ```bash
 # Set http_proxy and https_proxy to null to ensure that requests are not forwarded by a proxy.
 export http_proxy=
 export https_proxy=
 curl -X 'POST' \
  'http://127.0.0.1:8000/generate/' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "prompt": "What is AI?",
  "n_predict": 32
 }'
 ```
 And you should get output like this:
 ```json
 {
  "generated_text": "What is AI? Artificial intelligence (AI) refers to the development of computer systems able to perform tasks that would normally require human intelligence, such as visual perception, speech",
  "generate_time": "0.45149803161621094s"
 }
 ```
 **Important**: The first token latency is much larger than rest token latency, you could use [our benchmark tool](https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/dev/benchmark/README.md) to obtain more details about first and rest token latency.
 ### 4. Benchmark with wrk
 We use wrk for testing end-to-end throughput, check [here](https://github.com/wg/wrk).
 You can install by:
 ```bash
 sudo apt install wrk
 ```
 Please change the test url accordingly.
 ```bash
 # set t/c to the number of concurrencies to test full throughput.
 wrk -t1 -c1 -d5m -s ./wrk_script_1024.lua http://127.0.0.1:8000/generate/ --timeout 1m
 ```
--- a/docs/mddocs/Quickstart/dify_quickstart.md
+++ b/docs/mddocs/Quickstart/dify_quickstart.md
@ -0,0 +1,150 @@
 # Run Dify on Intel GPU
 [**Dify**](https://dify.ai/) is an open-source production-ready LLM app development platform; by integrating it with [`ipex-llm`](https://github.com/intel-analytics/ipex-llm), users can now easily leverage local LLMs running on Intel GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) for building complex AI workflows (e.g. RAG).  
 *See the demo of a RAG workflow in Dify running LLaMA2-7B on Intel A770 GPU below.*
 <video src="https://llm-assets.readthedocs.io/en/latest/_images/dify-rag-small.mp4" width="100%" controls></video>
 ## Quickstart
 ### 1. Install and Start `Ollama` Service on Intel GPU 
 Follow the steps in [Run Ollama on Intel GPU Guide](./ollama_quickstart.md) to install and run Ollama on Intel GPU. Ensure that `ollama serve` is running correctly and can be accessed through a local URL (e.g., `https://127.0.0.1:11434`) or a remote URL (e.g., `http://your_ip:11434`).
 We recommend pulling the desired model before proceeding with Dify. For instance, to pull the LLaMA2-7B model, you can use the following command:
 ```bash
 ollama pull llama2:7b
 ```
 ### 2. Install and Start `Dify`
 #### 2.1 Download `Dify`
 You can either clone the repository or download the source zip from [github](https://github.com/langgenius/dify/archive/refs/heads/main.zip):
 ```bash
 git clone https://github.com/langgenius/dify.git
 ```
 #### 2.2 Setup Redis and PostgreSQL
 Next, deploy PostgreSQL and Redis. You can choose to utilize Docker, following the steps in the [Local Source Code Start Guide](https://docs.dify.ai/getting-started/install-self-hosted/local-source-code#clone-dify), or proceed without Docker using the following instructions:
 - Install Redis by executing `sudo apt-get install redis-server`. Refer to [this guide](https://www.hostinger.com/tutorials/how-to-install-and-setup-redis-on-ubuntu/) for Redis environment setup, including password configuration and daemon settings.
 - Install PostgreSQL by following either [the Official PostgreSQL Tutorial](https://www.postgresql.org/docs/current/tutorial.html) or [a PostgreSQL Quickstart Guide](https://www.digitalocean.com/community/tutorials/how-to-install-postgresql-on-ubuntu-20-04-quickstart). After installation, proceed with the following PostgreSQL commands for setting up Dify. These commands create a username/password for Dify (e.g., `dify_user`, change `'your_password'` as desired), create a new database named `dify`, and grant privileges:
    ```sql
    CREATE USER dify_user WITH PASSWORD 'your_password';
    CREATE DATABASE dify;
    GRANT ALL PRIVILEGES ON DATABASE dify TO dify_user;
    ```
 Configure Redis and PostgreSQL settings in the `.env` file located under dify source folder `dify/api/`:
 ```bash dify/api/.env
 ### Example dify/api/.env
 ## Redis settings
 REDIS_HOST=localhost
 REDIS_PORT=6379
 REDIS_USERNAME=your_redis_user_name # change if needed
 REDIS_PASSWORD=your_redis_password # change if needed
 REDIS_DB=0
 ## postgreSQL settings
 DB_USERNAME=dify_user # change if needed
 DB_PASSWORD=your_dify_password # change if needed
 DB_HOST=localhost
 DB_PORT=5432
 DB_DATABASE=dify # change if needed
 ```
 #### 2.3 Server Deployment
 Follow the steps in the [`Server Deployment` section in Local Source Code Start Guide](https://docs.dify.ai/getting-started/install-self-hosted/local-source-code#server-deployment) to deploy and start the Dify Server.
 Upon successful deployment, you will see logs in the terminal similar to the following:
 ```bash
 INFO:werkzeug:
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:5001
 * Running on http://10.239.44.83:5001
 INFO:werkzeug:Press CTRL+C to quit
 INFO:werkzeug: * Restarting with stat
 WARNING:werkzeug: * Debugger is active!
 INFO:werkzeug: * Debugger PIN: 227-697-894
 ```
 #### 2.4 Deploy the frontend page
 Refer to the instructions provided in the [`Deploy the frontend page` section in Local Source Code Start Guide](https://docs.dify.ai/getting-started/install-self-hosted/local-source-code#deploy-the-frontend-page) to deploy the frontend page.
 Below is an example of environment variable configuration found in `dify/web/.env.local`:
 ```bash
 # For production release, change this to PRODUCTION
 NEXT_PUBLIC_DEPLOY_ENV=DEVELOPMENT
 NEXT_PUBLIC_EDITION=SELF_HOSTED
 NEXT_PUBLIC_API_PREFIX=http://localhost:5001/console/api
 NEXT_PUBLIC_PUBLIC_API_PREFIX=http://localhost:5001/api
 NEXT_PUBLIC_SENTRY_DSN=
 ```
 ```eval_rst
 .. note::
  If you encounter connection problems, you may run `export no_proxy=localhost,127.0.0.1` before starting API servcie, Worker service and frontend. 
 ```
 ### 3. How to Use `Dify`
 For comprehensive usage instructions of Dify, please refer to the [Dify Documentation](https://docs.dify.ai/). In this section, we'll only highlight a few key steps for local LLM setup.
 #### Setup Ollama
 Open your browser and access the Dify UI at `http://localhost:3000`.
 Configure the Ollama URL in `Settings > Model Providers > Ollama`. For detailed instructions on how to do this, see the [Ollama Guide in the Dify Documentation](https://docs.dify.ai/tutorials/model-configuration/ollama).
 <p align="center"><a href="https://docs.dify.ai/~gitbook/image?url=https%3A%2F%2F3866086014-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FRncMhlfeYTrpujwzDIqw%252Fuploads%252Fgit-blob-351b275c8b6420ff85c77e67bf39a11aaf899b7b%252Follama-config-en.png%3Falt%3Dmedia&width=768&dpr=2&quality=100&sign=1ec95e72d9d0459384cce28665eb84ffd8ed59c906ab0fdb3f47fa67f61275dc"  target="_blank" align="center"><img src="https://docs.dify.ai/~gitbook/image?url=https%3A%2F%2F3866086014-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FRncMhlfeYTrpujwzDIqw%252Fuploads%252Fgit-blob-351b275c8b6420ff85c77e67bf39a11aaf899b7b%252Follama-config-en.png%3Falt%3Dmedia&width=768&dpr=2&quality=100&sign=1ec95e72d9d0459384cce28665eb84ffd8ed59c906ab0fdb3f47fa67f61275dc" alt="rag-menu" width="80%" align="center"></a></p>
 Once Ollama is successfully connected, you will see a list of Ollama models similar to the following: 
 <p align="center"><a href="https://llm-assets.readthedocs.io/en/latest/_images/dify-p1.png" target="_blank" align="center">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/dify-p1.png" alt="image-p1" width=100%; />
 </a></p>
 #### Run a simple RAG 
 - Select the text summarization workflow template from the studio.
 <p><a href="https://llm-assets.readthedocs.io/en/latest/_images/dify-p2.png" target="_blank" align="center">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/dify-p2.png" alt="image-p2" width=100%; align="center" />
 </a></p>
 - Add a knowledge base and specify the LLM or embedding model to use. 
 <p><a href="https://llm-assets.readthedocs.io/en/latest/_images/dify-p3.png" target="_blank" align="center">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/dify-p3.png" alt="image-p3" width=100%; />
 </a></p>
 - Enter your input in the workflow and execute it. You'll find retrieval results and generated answers on the right.
 <p align="center"><a href="https://llm-assets.readthedocs.io/en/latest/_images/dify-p5.png" target="_blank" align="center">
 <img src="https://llm-assets.readthedocs.io/en/latest/_images/dify-p5.png" alt="image-20240221102252560" width=100%; align="center"/>
 </a></p>
--- a/docs/mddocs/Quickstart/fastchat_quickstart.md
+++ b/docs/mddocs/Quickstart/fastchat_quickstart.md
@ -0,0 +1,421 @@
 # Serving using IPEX-LLM and FastChat
 FastChat is an open platform for training, serving, and evaluating large language model based chatbots. You can find the detailed information at their [homepage](https://github.com/lm-sys/FastChat).
 IPEX-LLM can be easily integrated into FastChat so that user can use `IPEX-LLM` as a serving backend in the deployment.
 ## Quick Start
 This quickstart guide walks you through installing and running `FastChat` with `ipex-llm`.
 ## 1. Install IPEX-LLM with FastChat
 To run on CPU, you can install ipex-llm as follows:
 ```bash
 pip install --pre --upgrade ipex-llm[serving,all]
 ```
 To add GPU support for FastChat, you may install **`ipex-llm`** as follows:
 ```bash
 pip install --pre --upgrade ipex-llm[xpu,serving] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 ```
 ## 2. Start the service
 ### Launch controller
 You need first run the fastchat controller
 ```bash
 python3 -m fastchat.serve.controller
 ```
 If the controller run successfully, you can see the output like this:
 ```bash
 Uvicorn running on http://localhost:21001
 ```
 ### Launch model worker(s) and load models
 Using IPEX-LLM in FastChat does not impose any new limitations on model usage. Therefore, all Hugging Face Transformer models can be utilized in FastChat.
 #### IPEX-LLM worker
 To integrate IPEX-LLM with `FastChat` efficiently, we have provided a new model_worker implementation named `ipex_llm_worker.py`.
 ```bash
 # On CPU
 # Available low_bit format including sym_int4, sym_int8, bf16 etc.
 python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "cpu"
 # On GPU
 # Available low_bit format including sym_int4, sym_int8, fp16 etc.
 source /opt/intel/oneapi/setvars.sh
 export USE_XETLA=OFF
 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "xpu"
 ```
 We have also provided an option `--load-low-bit-model` to load models that have been converted and saved into disk using the `save_low_bit` interface as introduced in this [document](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/KeyFeatures/hugging_face_format.html#save-load).
 Check the following examples:
 ```bash
 # Or --device "cpu"
 python -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path /Low/Bit/Model/Path --trust-remote-code --device "xpu" --load-low-bit-model
 ```
 #### For self-speculative decoding example:
 You can use IPEX-LLM to run `self-speculative decoding` example. Refer to [here](https://github.com/intel-analytics/ipex-llm/tree/c9fac8c26bf1e1e8f7376fa9a62b32951dd9e85d/python/llm/example/GPU/Speculative-Decoding) for more details on intel MAX GPUs. Refer to [here](https://github.com/intel-analytics/ipex-llm/tree/c9fac8c26bf1e1e8f7376fa9a62b32951dd9e85d/python/llm/example/GPU/Speculative-Decoding) for more details on intel CPUs.
 ```bash
 # Available low_bit format only including bf16 on CPU.
 source ipex-llm-init -t
 python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path lmsys/vicuna-7b-v1.5 --low-bit "bf16" --trust-remote-code --device "cpu" --speculative
 # Available low_bit format only including fp16 on GPU.
 source /opt/intel/oneapi/setvars.sh
 export ENABLE_SDP_FUSION=1
 export SYCL_CACHE_PERSISTENT=1
 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path lmsys/vicuna-7b-v1.5 --low-bit "fp16" --trust-remote-code --device "xpu" --speculative
 ```
 You can get output like this:
 ```bash
 2024-04-12 18:18:09 | INFO | ipex_llm.transformers.utils | Converting the current model to sym_int4 format......
 2024-04-12 18:18:11 | INFO | model_worker | Register to controller
 2024-04-12 18:18:11 | ERROR | stderr | INFO:     Started server process [126133]
 2024-04-12 18:18:11 | ERROR | stderr | INFO:     Waiting for application startup.
 2024-04-12 18:18:11 | ERROR | stderr | INFO:     Application startup complete.
 2024-04-12 18:18:11 | ERROR | stderr | INFO:     Uvicorn running on http://localhost:21002
 ```
 For a full list of accepted arguments, you can refer to the main method of the `ipex_llm_worker.py`
 #### IPEX-LLM vLLM worker
 We also provide the `vllm_worker` which uses the [vLLM](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/vLLM-Serving) engine for better hardware utilization.
 To run using the `vLLM_worker`,  we don't need to change model name, just simply uses the following command:
 ```bash
 # On CPU
 python3 -m ipex_llm.serving.fastchat.vllm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --device cpu
 # On GPU
 source /opt/intel/oneapi/setvars.sh
 export USE_XETLA=OFF
 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 python3 -m ipex_llm.serving.fastchat.vllm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --device xpu --load-in-low-bit "sym_int4" --enforce-eager
 ```
 #### Launch multiple workers
 Sometimes we may want to start multiple workers for the best performance.  For running in CPU, you may want to seperate multiple workers in different sockets.  Assuming each socket have 48 physicall cores, then you may want to start two workers using the following example:
 ```bash
 export OMP_NUM_THREADS=48
 numactl -C 0-47 -m 0 python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "cpu" &
 # All the workers other than the first worker need to specify a different worker port and corresponding worker-address
 numactl -C 48-95 -m 1 python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "cpu" --port 21003 --worker-address "http://localhost:21003" &
 ```
 For GPU, we may want to start two workers using different GPUs.  To achieve this, you should use `ZE_AFFINITY_MASK` environment variable to select different GPUs for different workers.  Below shows an example:
 ```bash
 ZE_AFFINITY_MASK=1 python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "xpu" &
 # All the workers other than the first worker need to specify a different worker port and corresponding worker-address
 ZE_AFFINITY_MASK=2 python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --low-bit "sym_int4" --trust-remote-code --device "xpu" --port 21003 --worker-address "http://localhost:21003" &
 ```
 If you are not sure the effect of `ZE_AFFINITY_MASK`, then you could set `ZE_AFFINITY_MASK` and check the result of `sycl-ls`.
 ### Launch Gradio web server
 When you have started the controller and the worker, you can start web server as follows:
 ```bash
 python3 -m fastchat.serve.gradio_web_server
 ```
 This is the user interface that users will interact with.
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/fastchat_gradio_web_ui.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/fastchat_gradio_web_ui.png" width=100%; />
 </a>
 By following these steps, you will be able to serve your models using the web UI with IPEX-LLM as the backend. You can open your browser and chat with a model now.
 ### Launch TGI Style API server
 When you have started the controller and the worker, you can start TGI Style API server as follows:
 ```bash
 python3 -m ipex_llm.serving.fastchat.tgi_api_server --host localhost --port 8000
 ```
 You can use `curl` for observing the output of the api
 #### Using /generate API
 This is to send a sentence as inputs in the request, and is expected to receive a response containing model-generated answer.
 ```bash
 curl -X POST -H "Content-Type: application/json" -d '{
  "inputs": "What is AI?",
  "parameters": {
    "best_of": 1,
    "decoder_input_details": true,
    "details": true,
    "do_sample": true,
    "frequency_penalty": 0.1,
    "grammar": {
      "type": "json",
      "value": "string"
    },
    "max_new_tokens": 32,
    "repetition_penalty": 1.03,
    "return_full_text": false,
    "seed": 0.1,
    "stop": [
      "photographer"
    ],
    "temperature": 0.5,
    "top_k": 10,
    "top_n_tokens": 5,
    "top_p": 0.95,
    "truncate": true,
    "typical_p": 0.95,
    "watermark": true
  }
 }' http://localhost:8000/generate
 ```
 Sample output:
 ```bash
 {
    "details": {
        "best_of_sequences": [
            {
                "index": 0,
                "message": {
                    "role": "assistant",
                    "content": "\nArtificial Intelligence (AI) is a branch of computer science that attempts to simulate the way that the human brain works. It is a branch of computer "
                },
                "finish_reason": "length",
                "generated_text": "\nArtificial Intelligence (AI) is a branch of computer science that attempts to simulate the way that the human brain works. It is a branch of computer ",
                "generated_tokens": 31
            }
        ]
    },
    "generated_text": "\nArtificial Intelligence (AI) is a branch of computer science that attempts to simulate the way that the human brain works. It is a branch of computer ",
    "usage": {
        "prompt_tokens": 4,
        "total_tokens": 35,
        "completion_tokens": 31
    }
 }
 ```
 #### Using /generate_stream API
 This is to send a sentence as inputs in the request, and a long connection will be opened to continuously receive multiple responses containing model-generated answer.
 ```bash
 curl -X POST -H "Content-Type: application/json" -d '{
  "inputs": "What is AI?",
  "parameters": {
    "best_of": 1,
    "decoder_input_details": true,
    "details": true,
    "do_sample": true,
    "frequency_penalty": 0.1,
    "grammar": {
      "type": "json",
      "value": "string"
    },
    "max_new_tokens": 32,
    "repetition_penalty": 1.03,
    "return_full_text": false,
    "seed": 0.1,
    "stop": [
      "photographer"
    ],
    "temperature": 0.5,
    "top_k": 10,
    "top_n_tokens": 5,
    "top_p": 0.95,
    "truncate": true,
    "typical_p": 0.95,
    "watermark": true
  }
 }' http://localhost:8000/generate_stream
 ```
 Sample output:
 ```bash
 data: {"token": {"id": 663359, "text": "", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}
 data: {"token": {"id": 300560, "text": "\n", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}
 data: {"token": {"id": 725120, "text": "Artificial Intelligence ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}
 data: {"token": {"id": 734609, "text": "(AI) is ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}
 data: {"token": {"id": 362235, "text": "a branch of computer ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}
 data: {"token": {"id": 380983, "text": "science that attempts to ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}
 data: {"token": {"id": 249979, "text": "simulate the way that ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}
 data: {"token": {"id": 972663, "text": "the human brain ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}
 data: {"token": {"id": 793301, "text": "works. It is a ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}
 data: {"token": {"id": 501380, "text": "branch of computer ", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}
 data: {"token": {"id": 673232, "text": "", "logprob": 0.0, "special": false}, "generated_text": null, "details": null, "special_ret": null}
 data: {"token": {"id": 2, "text": "</s>", "logprob": 0.0, "special": true}, "generated_text": "\nArtificial Intelligence (AI) is a branch of computer science that attempts to simulate the way that the human brain works. It is a branch of computer ", "details": {"finish_reason": "eos_token", "generated_tokens": 31, "prefill_tokens": 4, "seed": 2023}, "special_ret": {"tensor": []}}
 ```
 ### Launch RESTful API server
 To start an OpenAI API server that provides compatible APIs using IPEX-LLM backend, you can launch the `openai_api_server` and follow this [doc](https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md) to use it.
 When you have started the controller and the worker, you can start RESTful API server as follows:
 ```bash
 python3 -m fastchat.serve.openai_api_server --host localhost --port 8000
 ```
 You can use `curl` for observing the output of the api
 You can format the output using `jq`
 #### List Models
 ```bash
 curl http://localhost:8000/v1/models | jq
 ```
 Example output
 ```json
 {
  "object": "list",
  "data": [
    {
      "id": "Llama-2-7b-chat-hf",
      "object": "model",
      "created": 1712919071,
      "owned_by": "fastchat",
      "root": "Llama-2-7b-chat-hf",
      "parent": null,
      "permission": [
        {
          "id": "modelperm-XpFyEE7Sewx4XYbEcdbCVz",
          "object": "model_permission",
          "created": 1712919071,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": true,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    }
  ]
 }
 ```
 #### Chat Completions
 ```bash
 curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Llama-2-7b-chat-hf",
    "messages": [{"role": "user", "content": "Hello! What is your name?"}]
  }' | jq
 ```
 Example output
 ```json
 {
  "id": "chatcmpl-jJ9vKSGkcDMTxKfLxK7q2x",
  "object": "chat.completion",
  "created": 1712919092,
  "model": "Llama-2-7b-chat-hf",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " Hello! My name is LLaMA, I'm a large language model trained by a team of researcher at Meta AI. Unterscheidung. 😊"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 15,
    "total_tokens": 53,
    "completion_tokens": 38
  }
 }
 ```
 #### Text Completions
 ```bash
 curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Llama-2-7b-chat-hf",
    "prompt": "Once upon a time",
    "max_tokens": 41,
    "temperature": 0.5
  }' | jq
 ```
 Example Output:
 ```json
 {
  "id": "cmpl-PsAkpTWMmBLzWCTtM4r97Y",
  "object": "text_completion",
  "created": 1712919307,
  "model": "Llama-2-7b-chat-hf",
  "choices": [
    {
      "index": 0,
      "text": ", in a far-off land, there was a magical kingdom called \"Happily Ever Laughter.\" It was a place where laughter was the key to happiness, and everyone who ",
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 45,
    "completion_tokens": 40
  }
 }
 ```
--- a/docs/mddocs/Quickstart/index.rst
+++ b/docs/mddocs/Quickstart/index.rst
@ -0,0 +1,33 @@
 IPEX-LLM Quickstart
 ================================
 .. note::
   We are adding more Quickstart guide.
 This section includes efficient guide to show you how to:
 * |bigdl_llm_migration_guide|_
 * `Install IPEX-LLM on Linux with Intel GPU <./install_linux_gpu.html>`_
 * `Install IPEX-LLM on Windows with Intel GPU <./install_windows_gpu.html>`_
 * `Install IPEX-LLM in Docker on Windows with Intel GPU <./docker_windows_gpu.html>`_
 * `Run PyTorch Inference on Intel GPU using Docker (on Linux or WSL) <./docker_benchmark_quickstart.html>`_
 * `Run Performance Benchmarking with IPEX-LLM <./benchmark_quickstart.html>`_
 * `Run Local RAG using Langchain-Chatchat on Intel GPU <./chatchat_quickstart.html>`_
 * `Run Text Generation WebUI on Intel GPU <./webui_quickstart.html>`_
 * `Run Open WebUI on Intel GPU <./open_webui_with_ollama_quickstart.html>`_
 * `Run PrivateGPT with IPEX-LLM on Intel GPU <./privateGPT_quickstart.html>`_
 * `Run Coding Copilot (Continue) in VSCode with Intel GPU <./continue_quickstart.html>`_
 * `Run Dify on Intel GPU <./dify_quickstart.html>`_
 * `Run llama.cpp with IPEX-LLM on Intel GPU <./llama_cpp_quickstart.html>`_
 * `Run Ollama with IPEX-LLM on Intel GPU <./ollama_quickstart.html>`_
 * `Run Llama 3 on Intel GPU using llama.cpp and ollama with IPEX-LLM <./llama3_llamacpp_ollama_quickstart.html>`_
 * `Run IPEX-LLM Serving with FastChat <./fastchat_quickstart.html>`_
 * `Run IPEX-LLM Serving with vLLM on Intel GPU <./vLLM_quickstart.html>`_
 * `Finetune LLM with Axolotl on Intel GPU <./axolotl_quickstart.html>`_
 * `Run IPEX-LLM serving on Multiple Intel GPUs using DeepSpeed AutoTP and FastApi <./deepspeed_autotp_fastapi_quickstart.html>`_
 .. |bigdl_llm_migration_guide| replace:: ``bigdl-llm`` Migration Guide
 .. _bigdl_llm_migration_guide: bigdl_llm_migration.html
--- a/docs/mddocs/Quickstart/install_linux_gpu.md
+++ b/docs/mddocs/Quickstart/install_linux_gpu.md
@ -0,0 +1,313 @@
 # Install IPEX-LLM on Linux with Intel GPU
 This guide demonstrates how to install IPEX-LLM on Linux with Intel GPUs. It applies to Intel Data Center GPU Flex Series and Max Series, as well as Intel Arc Series GPU.
 IPEX-LLM currently supports the Ubuntu 20.04 operating system and later, and supports PyTorch 2.0 and PyTorch 2.1 on Linux. This page demonstrates IPEX-LLM with PyTorch 2.1. Check the [Installation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#linux) page for more details.
 ## Install Prerequisites
 ### Install GPU Driver
 #### For Linux kernel 6.2
 * Install wget, gpg-agent
    ```bash
    sudo apt-get install -y gpg-agent wget
    wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | \
    sudo gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg
    echo "deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy client" | \
    sudo tee /etc/apt/sources.list.d/intel-gpu-jammy.list
    ```
    <img src="https://llm-assets.readthedocs.io/en/latest/_images/wget.png" width=100%; />
 * Install drivers
    ```bash
    sudo apt-get update
    sudo apt-get -y install \
        gawk \
        dkms \
        linux-headers-$(uname -r) \
        libc6-dev
    sudo apt install intel-i915-dkms intel-fw-gpu
    sudo apt-get install -y gawk libc6-dev udev\
        intel-opencl-icd intel-level-zero-gpu level-zero \
        intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \
        libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \
        libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \
        mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo
    sudo reboot
    ```
    <img src="https://llm-assets.readthedocs.io/en/latest/_images/i915.png" width=100%; />
    <img src="https://llm-assets.readthedocs.io/en/latest/_images/gawk.png" width=100%; />
 * Configure permissions
    ```bash
    sudo gpasswd -a ${USER} render
    newgrp render
    # Verify the device is working with i915 driver
    sudo apt-get install -y hwinfo
    hwinfo --display
    ```
 #### For Linux kernel 6.5
 * Install wget, gpg-agent
    ```bash
    sudo apt-get install -y gpg-agent wget
    wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | \
    sudo gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg
    echo "deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy client" | \
    sudo tee /etc/apt/sources.list.d/intel-gpu-jammy.list
    ```
    <img src="https://llm-assets.readthedocs.io/en/latest/_images/wget.png" width=100%; />
 * Install drivers
    ```bash
    sudo apt-get update
    sudo apt-get -y install \
        gawk \
        dkms \
        linux-headers-$(uname -r) \
        libc6-dev
    sudo apt-get install -y gawk libc6-dev udev\
        intel-opencl-icd intel-level-zero-gpu level-zero \
        intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \
        libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \
        libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \
        mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo
    sudo apt install -y intel-i915-dkms intel-fw-gpu
    sudo reboot
    ```
    <img src="https://llm-assets.readthedocs.io/en/latest/_images/gawk.png" width=100%; />
 #### (Optional) Update Level Zero on Intel Core™ Ultra iGPU
 For Intel Core™ Ultra integrated GPU, please make sure level_zero version >= 1.3.28717. The level_zero version can be checked with `sycl-ls`, and verison will be tagged behind `[ext_oneapi_level_zero:gpu]`.
 Here are the sample output of `sycl-ls`:
 ```
 [opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
 [opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) Ultra 5 125H OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
 [opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) Graphics OpenCL 3.0 NEO  [24.09.28717.12]
 [ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) Graphics 1.3 [1.3.28717]
 ```
 If you have level_zero version < 1.3.28717, you could update as follows:
 ```bash
 wget https://github.com/intel/intel-graphics-compiler/releases/download/igc-1.0.16238.4/intel-igc-core_1.0.16238.4_amd64.deb
 wget https://github.com/intel/intel-graphics-compiler/releases/download/igc-1.0.16238.4/intel-igc-opencl_1.0.16238.4_amd64.deb
 wget https://github.com/intel/compute-runtime/releases/download/24.09.28717.12/intel-level-zero-gpu-dbgsym_1.3.28717.12_amd64.ddeb
 wget https://github.com/intel/compute-runtime/releases/download/24.09.28717.12/intel-level-zero-gpu_1.3.28717.12_amd64.deb
 wget https://github.com/intel/compute-runtime/releases/download/24.09.28717.12/intel-opencl-icd-dbgsym_24.09.28717.12_amd64.ddeb
 wget https://github.com/intel/compute-runtime/releases/download/24.09.28717.12/intel-opencl-icd_24.09.28717.12_amd64.deb
 wget https://github.com/intel/compute-runtime/releases/download/24.09.28717.12/libigdgmm12_22.3.17_amd64.deb
 sudo dpkg -i *.deb
 ```
 ### Install oneAPI 
  ```
  wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null
  echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
  sudo apt update
  sudo apt install intel-oneapi-common-vars=2024.0.0-49406 \
    intel-oneapi-common-oneapi-vars=2024.0.0-49406 \
    intel-oneapi-diagnostics-utility=2024.0.0-49093 \
    intel-oneapi-compiler-dpcpp-cpp=2024.0.2-49895 \
    intel-oneapi-dpcpp-ct=2024.0.0-49381 \
    intel-oneapi-mkl=2024.0.0-49656 \
    intel-oneapi-mkl-devel=2024.0.0-49656 \
    intel-oneapi-mpi=2021.11.0-49493 \
    intel-oneapi-mpi-devel=2021.11.0-49493 \
    intel-oneapi-dal=2024.0.1-25 \
    intel-oneapi-dal-devel=2024.0.1-25 \
    intel-oneapi-ippcp=2021.9.1-5 \
    intel-oneapi-ippcp-devel=2021.9.1-5 \
    intel-oneapi-ipp=2021.10.1-13 \
    intel-oneapi-ipp-devel=2021.10.1-13 \
    intel-oneapi-tlt=2024.0.0-352 \
    intel-oneapi-ccl=2021.11.2-5 \
    intel-oneapi-ccl-devel=2021.11.2-5 \
    intel-oneapi-dnnl-devel=2024.0.0-49521 \
    intel-oneapi-dnnl=2024.0.0-49521 \
    intel-oneapi-tcm-1.0=1.0.0-435
  ```
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/oneapi.png" alt="image-20240221102252565" width=100%; />
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/basekit.png" alt="image-20240221102252565" width=100%; />
 ### Setup Python Environment
 Download and install the Miniforge as follows if you don't have conda installed on your machine:
  ```bash
  wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
  bash Miniforge3-Linux-x86_64.sh
  source ~/.bashrc
  ```
 You can use `conda --version` to verify you conda installation.
 After installation, create a new python environment `llm`:
 ```cmd
 conda create -n llm python=3.11
 ```
 Activate the newly created environment `llm`:
 ```cmd
 conda activate llm
 ```
 ## Install `ipex-llm`
 With the `llm` environment active, use `pip` to install `ipex-llm` for GPU.
 Choose either US or CN website for `extra-index-url`:
 ```eval_rst
 .. tabs::
   .. tab:: US
      .. code-block:: cmd
         pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
   .. tab:: CN
      .. code-block:: cmd
         pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/
 ```
 ```eval_rst
 .. note::
  If you encounter network issues while installing IPEX, refer to `this guide <https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#id3>`_ for troubleshooting advice.
 ```
 ## Verify Installation
 * You can verify if `ipex-llm` is successfully installed by simply importing a few classes from the library. For example, execute the following import command in the terminal:
  ```bash
  source /opt/intel/oneapi/setvars.sh
  python
  > from ipex_llm.transformers import AutoModel, AutoModelForCausalLM
  ```
 ## Runtime Configurations
 To use GPU acceleration on Linux, several environment variables are required or recommended before running a GPU example.
 ```eval_rst
 .. tabs::
   .. tab:: Intel Arc™ A-Series and Intel Data Center GPU Flex
      For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series, we recommend:
      .. code-block:: bash
         # Configure oneAPI environment variables. Required step for APT or offline installed oneAPI.
         # Skip this step for PIP-installed oneAPI since the environment has already been configured in LD_LIBRARY_PATH.
         source /opt/intel/oneapi/setvars.sh
         # Recommended Environment Variables for optimal performance
         export USE_XETLA=OFF
         export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
         export SYCL_CACHE_PERSISTENT=1
   .. tab:: Intel Data Center GPU Max
      For Intel Data Center GPU Max Series, we recommend:
      .. code-block:: bash
         # Configure oneAPI environment variables. Required step for APT or offline installed oneAPI.
         # Skip this step for PIP-installed oneAPI since the environment has already been configured in LD_LIBRARY_PATH.
         source /opt/intel/oneapi/setvars.sh
         # Recommended Environment Variables for optimal performance
         export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
         export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
         export SYCL_CACHE_PERSISTENT=1
         export ENABLE_SDP_FUSION=1
      Please note that ``libtcmalloc.so`` can be installed by ``conda install -c conda-forge -y gperftools=2.10``
 ```
  ```eval_rst
  .. seealso::
     Please refer to `this guide <../Overview/install_gpu.html#id5>`_ for more details regarding runtime configuration.
  ```
 ## A Quick Example
 Now let's play with a real LLM. We'll be using the [phi-1.5](https://huggingface.co/microsoft/phi-1_5) model, a 1.3 billion parameter LLM for this demostration. Follow the steps below to setup and run the model, and observe how it responds to a prompt "What is AI?". 
 * Step 1: Activate the Python environment `llm` you previously created: 
   ```bash
   conda activate llm
   ```
 * Step 2: Follow [Runtime Configurations Section](#runtime-configurations) above to prepare your runtime environment.  
 * Step 3: Create a new file named `demo.py` and insert the code snippet below.
   ```python
   # Copy/Paste the contents to a new file demo.py
   import torch
   from ipex_llm.transformers import AutoModelForCausalLM
   from transformers import AutoTokenizer, GenerationConfig
   generation_config = GenerationConfig(use_cache = True)
   tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b", trust_remote_code=True)
   # load Model using ipex-llm and load it to GPU
   model = AutoModelForCausalLM.from_pretrained(
       "tiiuae/falcon-7b", load_in_4bit=True, cpu_embedding=True, trust_remote_code=True)
   model = model.to('xpu')
   # Format the prompt
   question = "What is AI?"
   prompt = " Question:{prompt}\n\n Answer:".format(prompt=question)
   # Generate predicted tokens
   with torch.inference_mode():
       input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
       # warm up one more time before the actual generation task for the first run, see details in `Tips & Troubleshooting`
       # output = model.generate(input_ids, do_sample=False, max_new_tokens=32, generation_config = generation_config)
       output = model.generate(input_ids, do_sample=False, max_new_tokens=32, generation_config = generation_config).cpu()
       output_str = tokenizer.decode(output[0], skip_special_tokens=True)
       print(output_str)
   ```
   > Note: when running LLMs on Intel iGPUs with limited memory size, we recommend setting `cpu_embedding=True` in the `from_pretrained` function.
   > This will allow the memory-intensive embedding layer to utilize the CPU instead of GPU.
 * Step 5. Run `demo.py` within the activated Python environment using the following command:
  ```bash
  python demo.py
  ```
   ### Example output
   Example output on a system equipped with an 11th Gen Intel Core i7 CPU and Iris Xe Graphics iGPU:
   ```
   Question:What is AI?
   Answer: AI stands for Artificial Intelligence, which is the simulation of human intelligence in machines.
   ```
 ## Tips & Troubleshooting
 ### Warmup for optimial performance on first run
 When running LLMs on GPU for the first time, you might notice the performance is lower than expected, with delays up to several minutes before the first token is generated. This delay occurs because the GPU kernels require compilation and initialization, which varies across different GPU types. To achieve optimal and consistent performance, we recommend a one-time warm-up by running `model.generate(...)` an additional time before starting your actual generation tasks. If you're developing an application, you can incorporate this warmup step into start-up or loading routine to enhance the user experience.
--- a/docs/mddocs/Quickstart/install_windows_gpu.md
+++ b/docs/mddocs/Quickstart/install_windows_gpu.md
@ -0,0 +1,305 @@
 # Install IPEX-LLM on Windows with Intel GPU
 This guide demonstrates how to install IPEX-LLM on Windows with Intel GPUs. 
 It applies to Intel Core Ultra and Core 11 - 14 gen integrated GPUs (iGPUs), as well as Intel Arc Series GPU.
 ## Install Prerequisites
 ### (Optional) Update GPU Driver
 ```eval_rst
 .. tip::
   It is recommended to update your GPU driver, if you have driver version lower than ``31.0.101.5122``. Refer to `here <../Overview/install_gpu.html#prerequisites>`_ for more information.
 ```
 Download and install the latest GPU driver from the [official Intel download page](https://www.intel.com/content/www/us/en/download/785597/intel-arc-iris-xe-graphics-windows.html). A system reboot is necessary to apply the changes after the installation is complete.
 ```eval_rst
 .. note::
   The process could take around 10 minutes. After reboot, check for the **Intel Arc Control** application to verify the driver has been installed correctly. If the installation was successful, you should see the **Arc Control** interface similar to the figure below
 ```
 <img src="https://llm-assets.readthedocs.io/en/latest/_images/quickstart_windows_gpu_3.png" width=100%; />
 <!-- ### Install oneAPI  -->
 <!-- Download and install the [**Intel oneAPI Base Toolkit 2024.0**](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html?operatingsystem=window&distributions=offline). During installation, you can continue with the default installation settings.
 <img src="https://llm-assets.readthedocs.io/en/latest/_images/quickstart_windows_gpu_oneapi_offline_installer.png"  width=100%; />
 ```eval_rst
 .. tip::
   If the oneAPI installation hangs at the finalization step for more than 10 minutes, the error might be due to a problematic install of Visual Studio. Please reboot your computer and then launch the Visual Studio installer. If you see installation error messages, please repair your Visual Studio installation. After the repair is done, oneAPI installation is completed successfully.
 ``` -->
 ### Setup Python Environment
 Visit [Miniforge installation page](https://conda-forge.org/download/), download the **Miniforge installer for Windows**, and follow the instructions to complete the installation.
 <div align="center">
 <img src="https://llm-assets.readthedocs.io/en/latest/_images/quickstart_windows_gpu_miniforge_download.png"  width=80%/>
 </div>
 After installation, open the **Miniforge Prompt**, create a new python environment `llm`:
 ```cmd
 conda create -n llm python=3.11 libuv
 ```
 Activate the newly created environment `llm`:
 ```cmd
 conda activate llm
 ```
 ## Install `ipex-llm`
 With the `llm` environment active, use `pip` to install `ipex-llm` for GPU. Choose either US or CN website for `extra-index-url`:
 ```eval_rst
 .. tabs::
   .. tab:: US
      .. code-block:: cmd
         pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
   .. tab:: CN
      .. code-block:: cmd
         pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/
 ```
 ```eval_rst
 .. note::
  If you encounter network issues while installing IPEX, refer to `this guide <https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#install-ipex-llm-from-wheel>`_ for troubleshooting advice.
 ```
 ## Verify Installation
 You can verify if `ipex-llm` is successfully installed following below steps.
 ### Step 1: Runtime Configurations
 * Open the **Miniforge Prompt** and activate the Python environment `llm` you previously created: 
   ```cmd
   conda activate llm
   ```
 <!-- * Configure oneAPI variables by running the following command:
   ```cmd
   call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
   ``` -->
 * Set the following environment variables according to your device:
  ```eval_rst
  .. tabs::
     .. tab:: Intel iGPU
        .. code-block:: cmd
           set SYCL_CACHE_PERSISTENT=1
           set BIGDL_LLM_XMX_DISABLED=1
     .. tab:: Intel Arc™ A770
        .. code-block:: cmd
           set SYCL_CACHE_PERSISTENT=1
  ```
  ```eval_rst
  .. seealso::
     For other Intel dGPU Series, please refer to `this guide <../Overview/install_gpu.html#runtime-configuration>`_ for more details regarding runtime configuration.
  ```
 ### Step 2: Run Python Code
 *  Launch the Python interactive shell by typing `python` in the Miniforge Prompt window and then press Enter.
 * Copy following code to Miniforge Prompt **line by line** and press Enter **after copying each line**.
  ```python
  import torch 
  from ipex_llm.transformers import AutoModel,AutoModelForCausalLM    
  tensor_1 = torch.randn(1, 1, 40, 128).to('xpu') 
  tensor_2 = torch.randn(1, 1, 128, 40).to('xpu') 
  print(torch.matmul(tensor_1, tensor_2).size()) 
  ```
  It will output following content at the end:
  ```
  torch.Size([1, 1, 40, 40])
  ```
  ```eval_rst
  .. seealso::
    If you encounter any problem, please refer to `here <https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#troubleshooting>`_ for help.
  ```
 * To exit the Python interactive shell, simply press Ctrl+Z then press Enter (or input `exit()` then press Enter).
 ## Monitor GPU Status
 To monitor your GPU's performance and status (e.g. memory consumption, utilization, etc.), you can use either the **Windows Task Manager (in `Performance` Tab)** (see the left side of the figure below) or the **Arc Control** application (see the right side of the figure below)
 <img src="https://llm-assets.readthedocs.io/en/latest/_images/quickstart_windows_gpu_4.png"  width=100%; />
 ## A Quick Example
 Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://huggingface.co/Qwen/Qwen-1_8B-Chat) model, a 1.8 billion parameter LLM for this demonstration. Follow the steps below to setup and run the model, and observe how it responds to a prompt "What is AI?". 
 * Step 1: Follow [Runtime Configurations Section](#step-1-runtime-configurations) above to prepare your runtime environment.  
 * Step 2: Install additional package required for Qwen-1.8B-Chat to conduct:
   ```cmd
   pip install tiktoken transformers_stream_generator einops
   ```
 * Step 3: Create code file. IPEX-LLM supports loading model from Hugging Face or ModelScope. Please choose according to your requirements.
  ```eval_rst
  .. tabs::
     .. tab:: Hugging Face
        Create a new file named ``demo.py`` and insert the code snippet below to run `Qwen-1.8B-Chat <https://huggingface.co/Qwen/Qwen-1_8B-Chat>`_ model with IPEX-LLM optimizations.
        .. code-block:: python
           # Copy/Paste the contents to a new file demo.py
           import torch
           from ipex_llm.transformers import AutoModelForCausalLM
           from transformers import AutoTokenizer, GenerationConfig
           generation_config = GenerationConfig(use_cache=True)
           print('Now start loading Tokenizer and optimizing Model...')
           tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-1_8B-Chat",
                                                      trust_remote_code=True)
           # Load Model using ipex-llm and load it to GPU
           model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat",
                                                        load_in_4bit=True,
                                                        cpu_embedding=True,
                                                        trust_remote_code=True)
           model = model.to('xpu')
           print('Successfully loaded Tokenizer and optimized Model!')
           # Format the prompt
           question = "What is AI?"
           prompt = "user: {prompt}\n\nassistant:".format(prompt=question)
           # Generate predicted tokens
           with torch.inference_mode():
               input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
               print('--------------------------------------Note-----------------------------------------')
               print('| For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or |')
               print('| Pro A60, it may take several minutes for GPU kernels to compile and initialize. |')
               print('| Please be patient until it finishes warm-up...                                  |')
               print('-----------------------------------------------------------------------------------')
               # To achieve optimal and consistent performance, we recommend a one-time warm-up by running `model.generate(...)` an additional time before starting your actual generation tasks.
               # If you're developing an application, you can incorporate this warm-up step into start-up or loading routine to enhance the user experience.
               output = model.generate(input_ids,
                                       do_sample=False,
                                       max_new_tokens=32,
                                       generation_config=generation_config) # warm-up
               print('Successfully finished warm-up, now start generation...')
               output = model.generate(input_ids,
                                       do_sample=False,
                                       max_new_tokens=32,
                                       generation_config=generation_config).cpu()
               output_str = tokenizer.decode(output[0], skip_special_tokens=True)
               print(output_str)
     .. tab:: ModelScope
        Please first run following command in Miniforge Prompt to install ModelScope:
        .. code-block:: cmd
           pip install modelscope==1.11.0
        Create a new file named ``demo.py`` and insert the code snippet below to run `Qwen-1.8B-Chat <https://www.modelscope.cn/models/qwen/Qwen-1_8B-Chat/summary>`_ model with IPEX-LLM optimizations.
        .. code-block:: python
           # Copy/Paste the contents to a new file demo.py
           import torch
           from ipex_llm.transformers import AutoModelForCausalLM
           from transformers import GenerationConfig
           from modelscope import AutoTokenizer
           generation_config = GenerationConfig(use_cache=True)
           print('Now start loading Tokenizer and optimizing Model...')
           tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-1_8B-Chat",
                                                      trust_remote_code=True)
           # Load Model using ipex-llm and load it to GPU
           model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat",
                                                        load_in_4bit=True,
                                                        cpu_embedding=True,
                                                        trust_remote_code=True,
                                                        model_hub='modelscope')
           model = model.to('xpu')
           print('Successfully loaded Tokenizer and optimized Model!')
           # Format the prompt
           question = "What is AI?"
           prompt = "user: {prompt}\n\nassistant:".format(prompt=question)
           # Generate predicted tokens
           with torch.inference_mode():
               input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
               print('--------------------------------------Note-----------------------------------------')
               print('| For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or |')
               print('| Pro A60, it may take several minutes for GPU kernels to compile and initialize. |')
               print('| Please be patient until it finishes warm-up...                                  |')
               print('-----------------------------------------------------------------------------------')
               # To achieve optimal and consistent performance, we recommend a one-time warm-up by running `model.generate(...)` an additional time before starting your actual generation tasks.
               # If you're developing an application, you can incorporate this warm-up step into start-up or loading routine to enhance the user experience.
               output = model.generate(input_ids,
                                       do_sample=False,
                                       max_new_tokens=32,
                                       generation_config=generation_config) # warm-up
               print('Successfully finished warm-up, now start generation...')
               output = model.generate(input_ids,
                                       do_sample=False,
                                       max_new_tokens=32,
                                       generation_config=generation_config).cpu()
               output_str = tokenizer.decode(output[0], skip_special_tokens=True)
               print(output_str)
        .. tip::
           Please note that the repo id on ModelScope may be different from Hugging Face for some models.
  ```
  ```eval_rst
  .. note::
     When running LLMs on Intel iGPUs with limited memory size, we recommend setting ``cpu_embedding=True`` in the ``from_pretrained`` function.
     This will allow the memory-intensive embedding layer to utilize the CPU instead of GPU.
  ```
 * Step 4. Run `demo.py` within the activated Python environment using the following command:
  ```cmd
  python demo.py
  ```
   ### Example output
   Example output on a system equipped with an Intel Core Ultra 5 125H CPU and Intel Arc Graphics iGPU:
   ```
   user: What is AI?
   assistant: AI stands for Artificial Intelligence, which refers to the development of computer systems that can perform tasks that typically require human intelligence, such as visual perception, speech recognition,
   ```
 ## Tips & Troubleshooting
 ### Warm-up for optimal performance on first run
 When running LLMs on GPU for the first time, you might notice the performance is lower than expected, with delays up to several minutes before the first token is generated. This delay occurs because the GPU kernels require compilation and initialization, which varies across different GPU types. To achieve optimal and consistent performance, we recommend a one-time warm-up by running `model.generate(...)` an additional time before starting your actual generation tasks. If you're developing an application, you can incorporate this warm-up step into start-up or loading routine to enhance the user experience.
--- a/docs/mddocs/Quickstart/llama3_llamacpp_ollama_quickstart.md
+++ b/docs/mddocs/Quickstart/llama3_llamacpp_ollama_quickstart.md
@ -0,0 +1,201 @@
 # Run Llama 3 on Intel GPU using llama.cpp and ollama with IPEX-LLM
 [Llama 3](https://llama.meta.com/llama3/) is the latest Large Language Models released by [Meta](https://llama.meta.com/) which provides state-of-the-art performance and excels at language nuances, contextual understanding, and complex tasks like translation and dialogue generation.
 Now, you can easily run Llama 3 on Intel GPU using `llama.cpp` and `Ollama` with IPEX-LLM.
 See the demo of running Llama-3-8B-Instruct on Intel Arc GPU using `Ollama` below.
 <video src="https://llm-assets.readthedocs.io/en/latest/_images/ollama-llama3-linux-arc.mp4" width="100%" controls></video>
 ## Quick Start
 This quickstart guide walks you through how to run Llama 3 on Intel GPU using `llama.cpp` / `Ollama` with IPEX-LLM.
 ### 1. Run Llama 3 using llama.cpp
 #### 1.1 Install IPEX-LLM for llama.cpp and Initialize
 Visit [Run llama.cpp with IPEX-LLM on Intel GPU Guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html), and follow the instructions in section [Prerequisites](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html#prerequisites) to setup and section [Install IPEX-LLM for llama.cpp](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html#install-ipex-llm-for-llama-cpp) to install the IPEX-LLM with llama.cpp binaries, then follow the instructions in section [Initialize llama.cpp with IPEX-LLM](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html#initialize-llama-cpp-with-ipex-llm) to initialize.
 **After above steps, you should have created a conda environment, named `llm-cpp` for instance and have llama.cpp binaries in your current directory.**
 **Now you can use these executable files by standard llama.cpp usage.**
 #### 1.2 Download Llama3
 There already are some GGUF models of Llama3 in community, here we take [Meta-Llama-3-8B-Instruct-GGUF](https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF) for example.
 Suppose you have downloaded a [Meta-Llama-3-8B-Instruct-Q4_K_M.gguf](https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf) model from [Meta-Llama-3-8B-Instruct-GGUF](https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF) and put it under `<model_dir>`.
 #### 1.3 Run Llama3 on Intel GPU using llama.cpp
 #### Runtime Configuration
 To use GPU acceleration, several environment variables are required or recommended before running `llama.cpp`.
 ```eval_rst
 .. tabs::
   .. tab:: Linux
      .. code-block:: bash
         source /opt/intel/oneapi/setvars.sh
         export SYCL_CACHE_PERSISTENT=1
   .. tab:: Windows
      .. code-block:: bash
         set SYCL_CACHE_PERSISTENT=1
 ```
 ```eval_rst
 .. tip::
  If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance:
  .. code-block:: bash
      export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 ```
 ##### Run llama3
 Under your current directory, exceuting below command to do inference with Llama3:
 ```eval_rst
 .. tabs::
   .. tab:: Linux
      .. code-block:: bash
         ./main -m <model_dir>/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun doing something" -t 8 -e -ngl 33 --color --no-mmap
   .. tab:: Windows
      Please run the following command in Miniforge Prompt.
      .. code-block:: bash
        main -m <model_dir>/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun doing something" -e -ngl 33 --color --no-mmap
 ```
 Under your current directory, you can also execute below command to have interactive chat with Llama3:
 ```eval_rst
 .. tabs::
   .. tab:: Linux
      .. code-block:: bash
         ./main -ngl 33 --interactive-first --color -e --in-prefix '<|start_header_id|>user<|end_header_id|>\n\n' --in-suffix '<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' -r '<|eot_id|>' -m <model_dir>/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf
   .. tab:: Windows
      Please run the following command in Miniforge Prompt.
      .. code-block:: bash
        main -ngl 33 --interactive-first --color -e --in-prefix "<|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -r "<|eot_id|>" -m <model_dir>/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf
 ```
 Below is a sample output on Intel Arc GPU:
 <img src="https://llm-assets.readthedocs.io/en/latest/_images/llama3-cpp-arc-demo.png" width=100%; />
 ### 2. Run Llama3 using Ollama
 #### 2.1 Install IPEX-LLM for Ollama and Initialize
 Visit [Run Ollama with IPEX-LLM on Intel GPU](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/ollama_quickstart.html), and follow the instructions in section [Install IPEX-LLM for llama.cpp](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html#install-ipex-llm-for-llama-cpp) to install the IPEX-LLM with Ollama binary, then follow the instructions in section [Initialize Ollama](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/ollama_quickstart.html#initialize-ollama) to initialize.
 **After above steps, you should have created a conda environment, named `llm-cpp` for instance and have ollama binary file in your current directory.**
 **Now you can use this executable file by standard Ollama usage.**
 #### 2.2 Run Llama3 on Intel GPU using Ollama
 [ollama/ollama](https://github.com/ollama/ollama) has alreadly added [Llama3](https://ollama.com/library/llama3) into its library, so it's really easy to run Llama3 using ollama now.
 ##### 2.2.1 Run Ollama Serve
 Launch the Ollama service:
 ```eval_rst
 .. tabs::
   .. tab:: Linux
      .. code-block:: bash
         export no_proxy=localhost,127.0.0.1
         export ZES_ENABLE_SYSMAN=1
         export OLLAMA_NUM_GPU=999
         source /opt/intel/oneapi/setvars.sh
         export SYCL_CACHE_PERSISTENT=1
         ./ollama serve
   .. tab:: Windows
      Please run the following command in Miniforge Prompt.
      .. code-block:: bash
         set no_proxy=localhost,127.0.0.1
         set ZES_ENABLE_SYSMAN=1
         set OLLAMA_NUM_GPU=999
         set SYCL_CACHE_PERSISTENT=1
         ollama serve
 ```
 ```eval_rst
 .. tip::
  If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before executing `ollama serve`:
  .. code-block:: bash
      export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 ```
 ```eval_rst
 .. note::
  To allow the service to accept connections from all IP addresses, use `OLLAMA_HOST=0.0.0.0 ./ollama serve` instead of just `./ollama serve`.
 ```
 ##### 2.2.2 Using Ollama Run Llama3
 Keep the Ollama service on and open another terminal and run llama3 with `ollama run`:
 ```eval_rst
 .. tabs::
   .. tab:: Linux
      .. code-block:: bash
         export no_proxy=localhost,127.0.0.1
         ./ollama run llama3:8b-instruct-q4_K_M
   .. tab:: Windows
      Please run the following command in Miniforge Prompt.
      .. code-block:: bash
         set no_proxy=localhost,127.0.0.1
         ollama run llama3:8b-instruct-q4_K_M
 ```
 ```eval_rst
 .. note::
  Here we just take `llama3:8b-instruct-q4_K_M` for example, you can replace it with any other Llama3 model you want.
 ```
 Below is a sample output on Intel Arc GPU :
 <img src="https://llm-assets.readthedocs.io/en/latest/_images/ollama-llama3-arc-demo.png" width=100%; />
--- a/docs/mddocs/Quickstart/llama_cpp_quickstart.md
+++ b/docs/mddocs/Quickstart/llama_cpp_quickstart.md
@ -0,0 +1,333 @@
 # Run llama.cpp with IPEX-LLM on Intel GPU 
 [ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp) prvoides fast LLM inference in in pure C++ across a variety of hardware; you can now use the C++ interface of [`ipex-llm`](https://github.com/intel-analytics/ipex-llm) as an accelerated backend for `llama.cpp` running on Intel **GPU** *(e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max)*.
 See the demo of running LLaMA2-7B on Intel Arc GPU below.
 <video src="https://llm-assets.readthedocs.io/en/latest/_images/llama-cpp-arc.mp4" width="100%" controls></video>
 ```eval_rst
 .. note::
  `ipex-llm[cpp]==2.5.0b20240527` is consistent with `c780e75 <https://github.com/ggerganov/llama.cpp/commit/c780e75305dba1f67691a8dc0e8bc8425838a452>`_ of llama.cpp.
  Our current version is consistent with `62bfef5 <https://github.com/ggerganov/llama.cpp/commit/62bfef5194d5582486d62da3db59bf44981b7912>`_ of llama.cpp.
 ```
 ## Quick Start
 This quickstart guide walks you through installing and running `llama.cpp` with `ipex-llm`.
 ### 0 Prerequisites
 IPEX-LLM's support for `llama.cpp` now is available for Linux system and Windows system.
 #### Linux
 For Linux system, we recommend Ubuntu 20.04 or later (Ubuntu 22.04 is preferred).
 Visit the [Install IPEX-LLM on Linux with Intel GPU](./install_linux_gpu.html), follow [Install Intel GPU Driver](./install_linux_gpu.html#install-intel-gpu-driver) and [Install oneAPI](./install_linux_gpu.html#install-oneapi) to install GPU driver and Intel® oneAPI Base Toolkit 2024.0.
 #### Windows (Optional)
 IPEX-LLM backend for llama.cpp only supports the more recent GPU drivers. Please make sure your GPU driver version is equal or newer than `31.0.101.5333`, otherwise you might find gibberish output. 
 If you have lower GPU driver version, visit the [Install IPEX-LLM on Windows with Intel GPU Guide](./install_windows_gpu.html), and follow [Update GPU driver](./install_windows_gpu.html#optional-update-gpu-driver).
 ### 1 Install IPEX-LLM for llama.cpp
 To use `llama.cpp` with IPEX-LLM, first ensure that `ipex-llm[cpp]` is installed.
 ```eval_rst
 .. tabs::
   .. tab:: Linux
      .. code-block:: bash
         conda create -n llm-cpp python=3.11
         conda activate llm-cpp
         pip install --pre --upgrade ipex-llm[cpp]
   .. tab:: Windows
      .. note::
      Please run the following command in Miniforge Prompt.
      .. code-block:: cmd
         conda create -n llm-cpp python=3.11
         conda activate llm-cpp
         pip install --pre --upgrade ipex-llm[cpp]
 ```
 **After the installation, you should have created a conda environment, named `llm-cpp` for instance, for running `llama.cpp` commands with IPEX-LLM.**
 ### 2 Setup for running llama.cpp
 First you should create a directory to use `llama.cpp`, for instance, use following command to create a `llama-cpp` directory and enter it.
 ```cmd
 mkdir llama-cpp
 cd llama-cpp
 ```
 #### Initialize llama.cpp with IPEX-LLM
 Then you can use following command to initialize `llama.cpp` with IPEX-LLM:
 ```eval_rst
 .. tabs::
   .. tab:: Linux
      .. code-block:: bash
         init-llama-cpp
      After ``init-llama-cpp``, you should see many soft links of ``llama.cpp``'s executable files and a ``convert.py`` in current directory.
      .. image:: https://llm-assets.readthedocs.io/en/latest/_images/init_llama_cpp_demo_image.png
   .. tab:: Windows
      Please run the following command with **administrator privilege in Miniforge Prompt**.
      .. code-block:: bash
         init-llama-cpp.bat
      After ``init-llama-cpp.bat``, you should see many soft links of ``llama.cpp``'s executable files and a ``convert.py`` in current directory.
      .. image:: https://llm-assets.readthedocs.io/en/latest/_images/init_llama_cpp_demo_image_windows.png
 ```
 ```eval_rst
 .. note::
   ``init-llama-cpp`` will create soft links of llama.cpp's executable files to current directory, if you want to use these executable files in other places, don't forget to run above commands again.
 ```
 ```eval_rst
 .. note::
   If you have installed higher version ``ipex-llm[cpp]`` and want to upgrade your binary file, don't forget to remove old binary files first and initialize again with ``init-llama-cpp`` or ``init-llama-cpp.bat``.
 ```
 **Now you can use these executable files by standard llama.cpp's usage.**
 #### Runtime Configuration
 To use GPU acceleration, several environment variables are required or recommended before running `llama.cpp`.
 ```eval_rst
 .. tabs::
   .. tab:: Linux
      .. code-block:: bash
         source /opt/intel/oneapi/setvars.sh
         export SYCL_CACHE_PERSISTENT=1
   .. tab:: Windows
      Please run the following command in Miniforge Prompt.
      .. code-block:: bash
         set SYCL_CACHE_PERSISTENT=1
 ```
 ```eval_rst
 .. tip::
  If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance:
  .. code-block:: bash
      export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 ```
 ### 3 Example: Running community GGUF models with IPEX-LLM
 Here we provide a simple example to show how to run a community GGUF model with IPEX-LLM.
 #### Model Download
 Before running, you should download or copy community GGUF model to your current directory. For instance,  `mistral-7b-instruct-v0.1.Q4_K_M.gguf` of [Mistral-7B-Instruct-v0.1-GGUF](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/tree/main).
 #### Run the quantized model
 ```eval_rst
 .. tabs::
   .. tab:: Linux
      .. code-block:: bash
         ./main -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -t 8 -e -ngl 33 --color
      .. note::
      For more details about meaning of each parameter, you can use ``./main -h``.
   .. tab:: Windows
      Please run the following command in Miniforge Prompt.
      .. code-block:: bash
         main -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -t 8 -e -ngl 33 --color
      .. note::
      For more details about meaning of each parameter, you can use ``main -h``.
 ```
 #### Sample Output
 ```
 Log start
 main: build = 1 (38bcbd4)
 main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.0.0 (2024.0.0.20231017) for x86_64-unknown-linux-gnu
 main: seed  = 1710359960
 ggml_init_sycl: GGML_SYCL_DEBUG: 0
 ggml_init_sycl: GGML_SYCL_F16: no
 found 8 SYCL devices:
 |ID| Name                                        |compute capability|Max compute units|Max work group|Max sub group|Global mem size|
 |--|---------------------------------------------|------------------|-----------------|--------------|-------------|---------------|
 | 0|               Intel(R) Arc(TM) A770 Graphics|               1.3|              512|          1024|           32|    16225243136|
 | 1|               Intel(R) FPGA Emulation Device|               1.2|               32|      67108864|           64|    67181625344|
 | 2|         13th Gen Intel(R) Core(TM) i9-13900K|               3.0|               32|          8192|           64|    67181625344|
 | 3|               Intel(R) Arc(TM) A770 Graphics|               3.0|              512|          1024|           32|    16225243136|
 | 4|               Intel(R) Arc(TM) A770 Graphics|               3.0|              512|          1024|           32|    16225243136|
 | 5|                    Intel(R) UHD Graphics 770|               3.0|               32|           512|           32|    53745299456|
 | 6|               Intel(R) Arc(TM) A770 Graphics|               1.3|              512|          1024|           32|    16225243136|
 | 7|                    Intel(R) UHD Graphics 770|               1.3|               32|           512|           32|    53745299456|
 detect 2 SYCL GPUs: [0,6] with Max compute units:512
 llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from ~/mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 llama_model_loader: - kv   0:                       general.architecture str              = llama
 llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
 llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
 llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
 llama_model_loader: - kv   4:                          llama.block_count u32              = 32
 llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
 llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
 llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
 llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
 llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
 llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
 llama_model_loader: - kv  11:                          general.file_type u32              = 15
 llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
 llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
 llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
 llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
 llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
 llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
 llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
 llama_model_loader: - kv  19:               general.quantization_version u32              = 2
 llama_model_loader: - type  f32:   65 tensors
 llama_model_loader: - type q4_K:  193 tensors
 llama_model_loader: - type q6_K:   33 tensors
 llm_load_vocab: special tokens definition check successful ( 259/32000 ).
 llm_load_print_meta: format           = GGUF V2
 llm_load_print_meta: arch             = llama
 llm_load_print_meta: vocab type       = SPM
 llm_load_print_meta: n_vocab          = 32000
 llm_load_print_meta: n_merges         = 0
 llm_load_print_meta: n_ctx_train      = 32768
 llm_load_print_meta: n_embd           = 4096
 llm_load_print_meta: n_head           = 32
 llm_load_print_meta: n_head_kv        = 8
 llm_load_print_meta: n_layer          = 32
 llm_load_print_meta: n_rot            = 128
 llm_load_print_meta: n_embd_head_k    = 128
 llm_load_print_meta: n_embd_head_v    = 128
 llm_load_print_meta: n_gqa            = 4
 llm_load_print_meta: n_embd_k_gqa     = 1024
 llm_load_print_meta: n_embd_v_gqa     = 1024
 llm_load_print_meta: f_norm_eps       = 0.0e+00
 llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
 llm_load_print_meta: f_clamp_kqv      = 0.0e+00
 llm_load_print_meta: f_max_alibi_bias = 0.0e+00
 llm_load_print_meta: n_ff             = 14336
 llm_load_print_meta: n_expert         = 0
 llm_load_print_meta: n_expert_used    = 0
 llm_load_print_meta: causal attm      = 1
 llm_load_print_meta: pooling type     = 0
 llm_load_print_meta: rope type        = 0
 llm_load_print_meta: rope scaling     = linear
 llm_load_print_meta: freq_base_train  = 10000.0
 llm_load_print_meta: freq_scale_train = 1
 llm_load_print_meta: n_yarn_orig_ctx  = 32768
 llm_load_print_meta: rope_finetuned   = unknown
 llm_load_print_meta: ssm_d_conv       = 0
 llm_load_print_meta: ssm_d_inner      = 0
 llm_load_print_meta: ssm_d_state      = 0
 llm_load_print_meta: ssm_dt_rank      = 0
 llm_load_print_meta: model type       = 7B
 llm_load_print_meta: model ftype      = Q4_K - Medium
 llm_load_print_meta: model params     = 7.24 B
 llm_load_print_meta: model size       = 4.07 GiB (4.83 BPW) 
 llm_load_print_meta: general.name     = mistralai_mistral-7b-instruct-v0.1
 llm_load_print_meta: BOS token        = 1 '<s>'
 llm_load_print_meta: EOS token        = 2 '</s>'
 llm_load_print_meta: UNK token        = 0 '<unk>'
 llm_load_print_meta: LF token         = 13 '<0x0A>'
 get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
 get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
 llm_load_tensors: ggml ctx size =    0.33 MiB
 llm_load_tensors: offloading 32 repeating layers to GPU
 llm_load_tensors: offloading non-repeating layers to GPU
 llm_load_tensors: offloaded 33/33 layers to GPU
 llm_load_tensors:      SYCL0 buffer size =  2113.28 MiB
 llm_load_tensors:      SYCL6 buffer size =  1981.77 MiB
 llm_load_tensors:  SYCL_Host buffer size =    70.31 MiB
 ...............................................................................................
 llama_new_context_with_model: n_ctx      = 512
 llama_new_context_with_model: freq_base  = 10000.0
 llama_new_context_with_model: freq_scale = 1
 llama_kv_cache_init:      SYCL0 KV buffer size =    34.00 MiB
 llama_kv_cache_init:      SYCL6 KV buffer size =    30.00 MiB
 llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
 llama_new_context_with_model:  SYCL_Host input buffer size   =    10.01 MiB
 llama_new_context_with_model:      SYCL0 compute buffer size =    73.00 MiB
 llama_new_context_with_model:      SYCL6 compute buffer size =    73.00 MiB
 llama_new_context_with_model:  SYCL_Host compute buffer size =     8.00 MiB
 llama_new_context_with_model: graph splits (measure): 3
 system_info: n_threads = 8 / 32 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
 sampling: 
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 sampling order: 
 CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
 generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 1
 Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun exploring the world around her. Her parents were kind and let her do what she wanted, as long as she stayed safe.
 One day, the little
 llama_print_timings:        load time =   10096.78 ms
 llama_print_timings:      sample time =     x.xx ms /    32 runs   (   xx.xx ms per token,  xx.xx tokens per second)
 llama_print_timings: prompt eval time =    xx.xx ms /    31 tokens (   xx.xx ms per token,  xx.xx tokens per second)
 llama_print_timings:        eval time =    xx.xx ms /    31 runs   (   xx.xx ms per token,  xx.xx tokens per second)
 llama_print_timings:       total time =    xx.xx ms /    62 tokens
 Log end
 ```
 ### Troubleshooting
 #### Fail to quantize model
 If you encounter `main: failed to quantize model from xxx`, please make sure you have created related output directory.
 #### Program hang during model loading
 If your program hang after `llm_load_tensors:  SYCL_Host buffer size =    xx.xx MiB`, you can add `--no-mmap` in your command.
 #### How to set `-ngl` parameter
 `-ngl` means the number of layers to store in VRAM. If your VRAM is enough, we recommend putting all layers on GPU, you can just set `-ngl` to a large number like 999 to achieve this goal.
 If `-ngl` is set to 0, it means that the entire model will run on CPU. If `-ngl` is set to greater than 0 and less than model layers, then it's mixed GPU + CPU scenario.
 #### How to specificy GPU
 If your machine has multi GPUs, `llama.cpp` will default use all GPUs which may slow down your inference for model which can run on single GPU. You can add `-sm none` in your command to use one GPU only.
 Also, you can use `ONEAPI_DEVICE_SELECTOR=level_zero:[gpu_id]` to select device before excuting your command, more details can refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/KeyFeatures/multi_gpus_selection.html#oneapi-device-selector).
 #### Program crash with Chinese prompt
 If you run the llama.cpp program on Windows and find that your program crashes or outputs abnormally when accepting Chinese prompts, you can open `Region->Administrative->Change System locale..`, check `Beta: Use Unicode UTF-8 for worldwide language support` option and then restart your computer.
 For detailed instructions on how to do this, see [this issue](https://github.com/intel-analytics/ipex-llm/issues/10989#issuecomment-2105600469).
--- a/docs/mddocs/Quickstart/ollama_quickstart.md
+++ b/docs/mddocs/Quickstart/ollama_quickstart.md
@ -0,0 +1,204 @@
 # Run Ollama with IPEX-LLM on Intel GPU
 [ollama/ollama](https://github.com/ollama/ollama) is popular framework designed to build and run language models on a local machine; you can now use the C++ interface of [`ipex-llm`](https://github.com/intel-analytics/ipex-llm) as an accelerated backend for `ollama` running on Intel **GPU** *(e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max)*.
 See the demo of running LLaMA2-7B on Intel Arc GPU below.
 <video src="https://llm-assets.readthedocs.io/en/latest/_images/ollama-linux-arc.mp4" width="100%" controls></video>
 ```eval_rst
 .. note::
  `ipex-llm[cpp]==2.5.0b20240527` is consistent with `v0.1.34 <https://github.com/ollama/ollama/releases/tag/v0.1.34>`_ of ollama.
  Our current version is consistent with `v0.1.39 <https://github.com/ollama/ollama/releases/tag/v0.1.39>`_ of ollama.
 ```
 ## Quickstart
 ### 1 Install IPEX-LLM for Ollama
 IPEX-LLM's support for `ollama` now is available for Linux system and Windows system.
 Visit [Run llama.cpp with IPEX-LLM on Intel GPU Guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html), and follow the instructions in section [Prerequisites](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html#prerequisites) to setup and section [Install IPEX-LLM cpp](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html#install-ipex-llm-for-llama-cpp) to install the IPEX-LLM with Ollama binaries.
 **After the installation, you should have created a conda environment, named `llm-cpp` for instance, for running `ollama` commands with IPEX-LLM.**
 ### 2. Initialize Ollama
 Activate the `llm-cpp` conda environment and initialize Ollama by executing the commands below. A symbolic link to `ollama` will appear in your current directory.
 ```eval_rst
 .. tabs::
   .. tab:: Linux
      .. code-block:: bash
         conda activate llm-cpp
         init-ollama
   .. tab:: Windows
      Please run the following command with **administrator privilege in Miniforge Prompt**.
      .. code-block:: bash
         conda activate llm-cpp
         init-ollama.bat
 ```
 ```eval_rst
 .. note::
   If you have installed higher version ``ipex-llm[cpp]`` and want to upgrade your ollama binary file, don't forget to remove old binary files first and initialize again with ``init-ollama`` or ``init-ollama.bat``.
 ```
 **Now you can use this executable file by standard ollama's usage.**
 ### 3 Run Ollama Serve
 You may launch the Ollama service as below:
 ```eval_rst
 .. tabs::
   .. tab:: Linux
      .. code-block:: bash
         export OLLAMA_NUM_GPU=999
         export no_proxy=localhost,127.0.0.1
         export ZES_ENABLE_SYSMAN=1
         source /opt/intel/oneapi/setvars.sh
         export SYCL_CACHE_PERSISTENT=1
         ./ollama serve
   .. tab:: Windows
      Please run the following command in Miniforge Prompt.
      .. code-block:: bash
         set OLLAMA_NUM_GPU=999
         set no_proxy=localhost,127.0.0.1
         set ZES_ENABLE_SYSMAN=1
         set SYCL_CACHE_PERSISTENT=1
         ollama serve
 ```
 ```eval_rst
 .. note::
  Please set environment variable ``OLLAMA_NUM_GPU`` to ``999`` to make sure all layers of your model are running on Intel GPU, otherwise, some layers may run on CPU.
 ```
 ```eval_rst
 .. tip::
  If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before executing `ollama serve`:
  .. code-block:: bash
      export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 ```
 ```eval_rst
 .. note::
  To allow the service to accept connections from all IP addresses, use `OLLAMA_HOST=0.0.0.0 ./ollama serve` instead of just `./ollama serve`.
 ```
 The console will display messages similar to the following:
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/ollama_serve.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/ollama_serve.png" width=100%; />
 </a>
 ### 4 Pull Model
 Keep the Ollama service on and open another terminal and run `./ollama pull <model_name>` in Linux (`ollama.exe pull <model_name>` in Windows) to automatically pull a model. e.g. `dolphin-phi:latest`:
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/ollama_pull.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/ollama_pull.png" width=100%; />
 </a>
 ### 5 Using Ollama
 #### Using Curl 
 Using `curl` is the easiest way to verify the API service and model. Execute the following commands in a terminal. **Replace the <model_name> with your pulled 
 model**, e.g. `dolphin-phi`.
 ```eval_rst
 .. tabs::
   .. tab:: Linux
      .. code-block:: bash
         curl http://localhost:11434/api/generate -d '
         { 
            "model": "<model_name>", 
            "prompt": "Why is the sky blue?", 
            "stream": false
         }'
   .. tab:: Windows
      Please run the following command in Miniforge Prompt.
      .. code-block:: bash
         curl http://localhost:11434/api/generate -d "
         {
            \"model\": \"<model_name>\",
            \"prompt\": \"Why is the sky blue?\",
            \"stream\": false
         }"
 ```
 #### Using Ollama Run GGUF models
 Ollama supports importing GGUF models in the Modelfile, for example, suppose you have downloaded a `mistral-7b-instruct-v0.1.Q4_K_M.gguf` from [Mistral-7B-Instruct-v0.1-GGUF](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/tree/main), then you can create a file named `Modelfile`:
 ```bash
 FROM ./mistral-7b-instruct-v0.1.Q4_K_M.gguf
 TEMPLATE [INST] {{ .Prompt }} [/INST]
 PARAMETER num_predict 64
 ```
 Then you can create the model in Ollama by `ollama create example -f Modelfile` and use `ollama run` to run the model directly on console.
 ```eval_rst
 .. tabs::
   .. tab:: Linux
      .. code-block:: bash
         export no_proxy=localhost,127.0.0.1
         ./ollama create example -f Modelfile
         ./ollama run example
   .. tab:: Windows
      Please run the following command in Miniforge Prompt.
      .. code-block:: bash
         set no_proxy=localhost,127.0.0.1
         ollama create example -f Modelfile
         ollama run example
 ```
 An example process of interacting with model with `ollama run example` looks like the following:
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/ollama_gguf_demo_image.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/ollama_gguf_demo_image.png" width=100%; />
 </a>
--- a/docs/mddocs/Quickstart/open_webui_with_ollama_quickstart.md
+++ b/docs/mddocs/Quickstart/open_webui_with_ollama_quickstart.md
@ -0,0 +1,208 @@
 # Run Open WebUI with Intel GPU
 [Open WebUI](https://github.com/open-webui/open-webui) is a user friendly GUI for running LLM locally; by porting it to [`ipex-llm`](https://github.com/intel-analytics/ipex-llm), users can now easily run LLM in [Open WebUI](https://github.com/open-webui/open-webui) on Intel **GPU** *(e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max)*.
 *See the demo of running Mistral:7B on Intel Arc A770 below.*
 <video src="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_demo.mp4" width="100%" controls></video>
 ## Quickstart
 This quickstart guide walks you through setting up and using [Open WebUI](https://github.com/open-webui/open-webui) with Ollama (using the C++ interface of [`ipex-llm`](https://github.com/intel-analytics/ipex-llm) as an accelerated backend).
 ### 1 Run Ollama with Intel GPU
 Follow the instructions on the [Run Ollama with Intel GPU](ollama_quickstart.html) to install and run "Ollama Serve". Please ensure that the Ollama server continues to run while you're using the Open WebUI.
 ### 2 Install the Open-Webui
 #### Install Node.js & npm
 ```eval_rst
 .. note::
  Package version requirements for running Open WebUI: Node.js (>= 20.10) or Bun (>= 1.0.21), Python (>= 3.11)
 ```
 Please install Node.js & npm as below:
 ```eval_rst
 .. tabs::
   .. tab:: Linux
      Run below commands to install Node.js & npm. Once the installation is complete, verify the installation by running ```node -v``` and ```npm -v``` to check the versions of Node.js and npm, respectively.
      .. code-block:: bash
        sudo apt update 
        sudo apt install nodejs 
        sudo apt install npm
   .. tab:: Windows
      You may download Node.js installation package from https://nodejs.org/dist/v20.12.2/node-v20.12.2-x64.msi, which will install both Node.js & npm on your system.
      Once the installation is complete, verify the installation by running ```node -v``` and ```npm -v``` to check the versions of Node.js and npm, respectively.
 ```
 #### Download the Open-Webui
 Use `git` to clone the [open-webui repo](https://github.com/open-webui/open-webui.git), or download the open-webui source code zip from [this link](https://github.com/open-webui/open-webui/archive/refs/heads/main.zip) and unzip it to a directory, e.g. `~/open-webui`. 
 #### Install Dependencies
 You may run below commands to install Open WebUI dependencies:
 ```eval_rst
 .. tabs::
   .. tab:: Linux
      .. code-block:: bash
         cd ~/open-webui/
         cp -RPp .env.example .env  # Copy required .env file
         # Build frontend
         npm i
         npm run build
         # Install Dependencies
         cd ./backend
         pip install -r requirements.txt -U
   .. tab:: Windows
      .. code-block:: bash
         cd ~\open-webui\
         copy .env.example .env
         # Build frontend
         npm install
         npm run build
         # Install Dependencies
         cd .\backend
         pip install -r requirements.txt -U
 ```
 ### 3. Start the Open-WebUI 
 #### Start the service
 Run below commands to start the service:
 ```eval_rst
 .. tabs::
  .. tab:: Linux
    .. code-block:: bash
       export no_proxy=localhost,127.0.0.1
       bash start.sh
    .. note:
    If you have difficulty accessing the huggingface repositories, you may use a mirror, e.g. add `export HF_ENDPOINT=https://hf-mirror.com` before running `bash start.sh`.
  .. tab:: Windows
    .. code-block:: bash
       set no_proxy=localhost,127.0.0.1
       start_windows.bat
    .. note:
    If you have difficulty accessing the huggingface repositories, you may use a mirror, e.g. add `set HF_ENDPOINT=https://hf-mirror.com` before running `start_windows.bat`.
 ```
 #### Access the WebUI
 Upon successful launch, URLs to access the WebUI will be displayed in the terminal. Open the provided local URL in your browser to interact with the WebUI, e.g. http://localhost:8080/.
 ### 4. Using the Open-Webui
 ```eval_rst
 .. note::
  For detailed information about how to use Open WebUI, visit the README of `open-webui official repository <https://github.com/open-webui/open-webui>`_.
 ```
 #### Log-in
 If this is your first time using it, you need to register. After registering, log in with the registered account to access the interface.
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_signup.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_signup.png" width="100%" />
 </a>
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_login.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_login.png" width="100%" />
 </a>
 #### Configure `Ollama` service URL
 Access the Ollama settings through **Settings -> Connections** in the menu. By default, the **Ollama Base URL** is preset to https://localhost:11434, as illustrated in the snapshot below. To verify the status of the Ollama service connection, click the **Refresh button** located next to the textbox. If the WebUI is unable to establish a connection with the Ollama server, you will see an error message stating, `WebUI could not connect to Ollama`.
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_settings_0.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_settings_0.png" width="100%" />
 </a>
 If the connection is successful, you will see a message stating `Service Connection Verified`, as illustrated below.
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_settings.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_settings.png" width="100%" />
 </a>
 ```eval_rst
 .. note::
  If you want to use an Ollama server hosted at a different URL, simply update the **Ollama Base URL** to the new URL and press the **Refresh** button to re-confirm the connection to Ollama. 
 ```
 #### Pull Model
 Go to **Settings -> Models** in the menu, choose a model under **Pull a model from Ollama.com** using the drop-down menu, and then hit the **Download** button on the right. Ollama will automatically download the selected model for you.
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_pull_models.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_pull_models.png" width="100%" />
 </a>
 #### Chat with the Model
 Start new conversations with **New chat** in the left-side menu. 
 On the right-side, choose a downloaded model from the **Select a model** drop-down menu at the top, input your questions into the **Send a Message** textbox at the bottom, and click the button on the right to get responses.
  <a href="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_select_model.png" target="_blank">
    <img src="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_select_model.png" width="100%" />
  </a> 
 <br/>
 Additionally, you can drag and drop a document into the textbox, allowing the LLM to access its contents. The LLM will then generate answers based on the document provided.
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_chat_2.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/open_webui_chat_2.png" width="100%" />
 </a>
 #### Exit Open-Webui
 To shut down the open-webui server, use **Ctrl+C** in the terminal where the open-webui server is runing, then close your browser tab.
 ### 5. Troubleshooting
 ##### Error `No module named 'torch._C`
 When you encounter the error ``ModuleNotFoundError: No module named 'torch._C'`` after executing ```bash start.sh```, you can resolve it by reinstalling PyTorch. First, use ```pip uninstall torch``` to remove the existing PyTorch installation, and then reinstall it along with its dependencies by running ```pip install torch torchvision torchaudio```.
--- a/docs/mddocs/Quickstart/privateGPT_quickstart.md
+++ b/docs/mddocs/Quickstart/privateGPT_quickstart.md
@ -0,0 +1,129 @@
 # Run PrivateGPT with IPEX-LLM on Intel GPU
 [PrivateGPT](https://github.com/zylon-ai/private-gpt) is a production-ready AI project that allows users to chat over documents, etc.; by integrating it with [`ipex-llm`](https://github.com/intel-analytics/ipex-llm), users can now easily leverage local LLMs running on Intel GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max).
 *See the demo of privateGPT running Mistral:7B on Intel Arc A770 below.*
 <video src="https://llm-assets.readthedocs.io/en/latest/_images/PrivateGPT-ARC.mp4" width="100%" controls></video>
 ## Quickstart
 ### 1. Install and Start `Ollama` Service on Intel GPU 
 Follow the steps in [Run Ollama on Intel GPU Guide](./ollama_quickstart.md) to install and run Ollama on Intel GPU. Ensure that `ollama serve` is running correctly and can be accessed through a local URL (e.g., `https://127.0.0.1:11434`) or a remote URL (e.g., `http://your_ip:11434`).
 We recommend pulling the desired model before proceeding with PrivateGPT. For instance, to pull the Mistral:7B model, you can use the following command:
 ```bash
 ollama pull mistral:7b
 ```
 ### 2. Install PrivateGPT
 #### Download PrivateGPT
 You can either clone the repository or download the source zip from [github](https://github.com/zylon-ai/private-gpt/archive/refs/heads/main.zip):
 ```bash
 git clone https://github.com/zylon-ai/private-gpt
 ```
 #### Install Dependencies
 Execute the following commands in a terminal to install the dependencies of PrivateGPT:
 ```cmd
 cd private-gpt
 pip install poetry
 pip install ffmpy==0.3.1
 poetry install --extras "ui llms-ollama embeddings-ollama vector-stores-qdrant"
 ```
 For more details, refer to the [PrivateGPT installation Guide](https://docs.privategpt.dev/installation/getting-started/main-concepts).
 ### 3. Start PrivateGPT
 #### Configure PrivateGPT
 To configure PrivateGPT to use Ollama for running local LLMs, you should edit the `private-gpt/settings-ollama.yaml` file. Modify the `ollama` section by setting the `llm_model` and `embedding_model` you wish to use, and updating the `api_base` and `embedding_api_base` to direct to your Ollama URL.
 Below is an example of how `settings-ollama.yaml` should look.
 <p align="center"><a href="https://llm-assets.readthedocs.io/en/latest/_images/privateGPT-ollama-setting.png" target="_blank" align="center">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/privateGPT-ollama-setting.png" alt="image-p1" width=100%; />
 </a></p>
 ```eval_rst
 .. note::
  `settings-ollama.yaml` is loaded when the Ollama profile is specified in the PGPT_PROFILES environment variable. This can override configurations from the default `settings.yaml`.
 ```
 For more information on configuring PrivateGPT, please visit the [PrivateGPT Main Concepts](https://docs.privategpt.dev/installation/getting-started/main-concepts) page.
 #### Start the service
 Please ensure that the Ollama server continues to run in a terminal while you're using the PrivateGPT. 
 Run below commands to start the service in another terminal:
 ```eval_rst
 .. tabs::
  .. tab:: Linux
    .. code-block:: bash
       export no_proxy=localhost,127.0.0.1
       PGPT_PROFILES=ollama make run
    .. note:
       Setting ``PGPT_PROFILES=ollama`` will load the configuration from ``settings.yaml`` and ``settings-ollama.yaml``.
  .. tab:: Windows
    .. code-block:: bash
       set no_proxy=localhost,127.0.0.1
       set PGPT_PROFILES=ollama
       make run
   .. note:
       Setting ``PGPT_PROFILES=ollama`` will load the configuration from ``settings.yaml`` and ``settings-ollama.yaml``.
 ```
 Upon successful deployment, you will see logs in the terminal similar to the following:
 <p align="center"><a href="https://llm-assets.readthedocs.io/en/latest/_images/privateGPT-service-success.png" target="_blank" align="center">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/privateGPT-service-success.png" alt="image-p1" width=100%; />
 </a></p>
 Open a browser (if it doesn't open automatically) and navigate to the URL displayed in the terminal. If it shows http://0.0.0.0:8001, you can access it locally via `http://127.0.0.1:8001` or remotely via `http://your_ip:8001`.
 ### 4. Using PrivateGPT
 #### Chat with the Model
 To chat with the LLM, select the "LLM Chat" option located in the upper left corner of the page. Type your messages at the bottom of the page and click the "Submit" button to receive responses from the model.
 <p align="center"><a href="https://llm-assets.readthedocs.io/en/latest/_images/privateGPT-LLM-Chat.png" target="_blank" align="center">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/privateGPT-LLM-Chat.png" alt="image-p1" width=100%; />
 </a></p>
 #### Chat over Documents (RAG)
 To interact with documents, select the "Query Files" option in the upper left corner of the page. Click the "Upload File(s)" button to upload documents. After the documents have been vectorized, you can type your messages at the bottom of the page and click the "Submit" button to receive responses from the model based on the uploaded content.
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/privateGPT-Query-Files.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/privateGPT-Query-Files.png" width=100%; />
 </a>
--- a/docs/mddocs/Quickstart/vLLM_quickstart.md
+++ b/docs/mddocs/Quickstart/vLLM_quickstart.md
@ -0,0 +1,276 @@
 # Serving using IPEX-LLM and vLLM on Intel GPU
 vLLM is a fast and easy-to-use library for LLM inference and serving. You can find the detailed information at their [homepage](https://github.com/vllm-project/vllm).
 IPEX-LLM can be integrated into vLLM so that user can use `IPEX-LLM` to boost the performance of vLLM engine on Intel **GPUs** *(e.g., local PC with descrete GPU such as Arc, Flex and Max)*.
 Currently, IPEX-LLM integrated vLLM only supports the following models:
 - Qwen series models
 - Llama series models
 - ChatGLM series models
 - Baichuan series models
 ## Quick Start
 This quickstart guide walks you through installing and running `vLLM` with `ipex-llm`.
 ### 1. Install IPEX-LLM for vLLM
 IPEX-LLM's support for `vLLM` now is available for only Linux system.
 Visit [Install IPEX-LLM on Linux with Intel GPU](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html) and follow the instructions in section [Install Prerequisites](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#install-prerequisites) to isntall prerequisites that are needed for running code on Intel GPUs.
 Then,follow instructions in section [Install ipex-llm](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#install-ipex-llm) to install `ipex-llm[xpu]` and setup the recommended runtime configurations.
 **After the installation, you should have created a conda environment, named `ipex-vllm` for instance, for running `vLLM` commands with IPEX-LLM.**
 ### 2. Install vLLM
 Currently, we maintain a specific branch of vLLM, which only works on Intel GPUs. 
 Activate the `ipex-vllm` conda environment and install vLLM by execcuting the commands below.
 ```bash
 conda activate ipex-vllm
 source /opt/intel/oneapi/setvars.sh
 git clone -b sycl_xpu https://github.com/analytics-zoo/vllm.git
 cd vllm
 pip install -r requirements-xpu.txt
 pip install --no-deps xformers
 VLLM_BUILD_XPU_OPS=1 pip install --no-build-isolation -v -e .
 pip install outlines==0.0.34 --no-deps
 pip install interegular cloudpickle diskcache joblib lark nest-asyncio numba scipy
 # For Qwen model support
 pip install transformers_stream_generator einops tiktoken
 ```
 **Now you are all set to use vLLM with IPEX-LLM**
 ## 3. Offline inference/Service
 ### Offline inference
 To run offline inference using vLLM for a quick impression, use the following example.
 ```eval_rst
 .. note::
  Please modify the MODEL_PATH in offline_inference.py to use your chosen model. 
  You can try modify load_in_low_bit to different values in **[sym_int4, fp6, fp8, fp8_e4m3, fp16]** to use different quantization dtype.
 ```
 ```bash
 #!/bin/bash
 wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/python/llm/example/GPU/vLLM-Serving/offline_inference.py
 python offline_inference.py
 ```
 For instructions on how to change the `load_in_low_bit` value in `offline_inference.py`, check the following example:
 ```bash
 llm = LLM(model="YOUR_MODEL",
          device="xpu",
          dtype="float16",
          enforce_eager=True,
          # Simply change here for the desired load_in_low_bit value
          load_in_low_bit="sym_int4",
          tensor_parallel_size=1,
          trust_remote_code=True)
 ```
 The result of executing `Baichuan2-7B-Chat` model with `sym_int4` low-bit format is shown as follows:
 ```
 Prompt: 'Hello, my name is', Generated text: ' [Your Name] and I am a [Your Job Title] at [Your'
 Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government in the United States. The president leads'
 Prompt: 'The capital of France is', Generated text: ' Paris.\nThe capital of France is Paris.'
 Prompt: 'The future of AI is', Generated text: " bright, but it's not without challenges. As AI continues to evolve,"
 ```
 ### Service
 ```eval_rst
 .. note::
  Because of using JIT compilation for kernels. We recommend to send a few requests for warmup before using the service for the best performance.
 ```
 To fully utilize the continuous batching feature of the `vLLM`, you can send requests to the service using `curl` or other similar methods. The requests sent to the engine will be batched at token level. Queries will be executed in the same `forward` step of the LLM and be removed when they are finished instead of waiting for all sequences to be finished.
 For vLLM, you can start the service using the following command:
 ```bash
 #!/bin/bash
 model="YOUR_MODEL_PATH"
 served_model_name="YOUR_MODEL_NAME"
 # You may need to adjust the value of
 # --max-model-len, --max-num-batched-tokens, --max-num-seqs
 # to acquire the best performance
 # Change value --load-in-low-bit to [fp6, fp8, fp8_e4m3, fp16] to use different low-bit formats
 python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
  --served-model-name $served_model_name \
  --port 8000 \
  --model $model \
  --trust-remote-code \
  --gpu-memory-utilization 0.75 \
  --device xpu \
  --dtype float16 \
  --enforce-eager \
  --load-in-low-bit sym_int4 \
  --max-model-len 4096 \
  --max-num-batched-tokens 10240 \
  --max-num-seqs 12 \
  --tensor-parallel-size 1
 ```
 You can tune the service using these four arguments:
 1. `--gpu-memory-utilization`: The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. For example, a value of 0.5 would imply 50% GPU memory utilization. If unspecified, will use the default value of 0.9.
 2. `--max-model-len`: Model context length. If unspecified, will be automatically derived from the model config.
 3. `--max-num-batched-token`: Maximum number of batched tokens per iteration.
 4. `--max-num-seq`: Maximum number of sequences per iteration. Default: 256
 For longer input prompt, we would suggest to use `--max-num-batched-token` to restrict the service.  The reason behind this logic is that the `peak GPU memory usage` will appear when generating first token.  By using `--max-num-batched-token`, we can restrict the input size when generating first token.
 `--max-num-seqs` will restrict the generation for both first token and rest token.  It will restrict the maximum batch size to the value set by `--max-num-seqs`.
 When out-of-memory error occurs, the most obvious solution is to reduce the `gpu-memory-utilization`.  Other ways to resolve this error is to set `--max-num-batched-token` if peak memory occurs when generating first token or using `--max-num-seq` if peak memory occurs when generating rest tokens.
 If the service have been booted successfully, the console will display messages similar to the following:
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/start-vllm-service.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/start-vllm-service.png" width=100%; />
 </a>
 After the service has been booted successfully, you can send a test request using `curl`. Here, `YOUR_MODEL` should be set equal to `$served_model_name` in your booting script, e.g. `Qwen1.5`.
 ```bash
 curl http://localhost:8000/v1/completions \
 -H "Content-Type: application/json" \
 -d '{
  "model": "YOUR_MODEL",
  "prompt": "San Francisco is a",
  "max_tokens": 128,
  "temperature": 0
 }' | jq '.choices[0].text'
 ```
 Below shows an example output using `Qwen1.5-7B-Chat` with low-bit format `sym_int4`:
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/vllm-curl-result.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/vllm-curl-result.png" width=100%; />
 </a>
 ```eval_rst
 .. tip::
  If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before starting the service:
  .. code-block:: bash
      export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 ```
 ## 4. About Tensor parallel
 > Note: We recommend to use docker for tensor parallel deployment. Check our serving docker image `intelanalytics/ipex-llm-serving-xpu`.
 We have also supported tensor parallel by using multiple Intel GPU cards. To enable tensor parallel, you will need to install `libfabric-dev` in your environment. In ubuntu, you can install it by:
 ```bash
 sudo apt-get install libfabric-dev
 ```
 To deploy your model across multiple cards, simplely change the value of `--tensor-parallel-size` to the desired value.
 For instance, if you have two Arc A770 cards in your environment, then you can set this value to 2. Some OneCCL environment variable settings are also needed, check the following example:
 ```bash
 #!/bin/bash
 model="YOUR_MODEL_PATH"
 served_model_name="YOUR_MODEL_NAME"
 # CCL needed environment variables
 export CCL_WORKER_COUNT=2
 export FI_PROVIDER=shm
 export CCL_ATL_TRANSPORT=ofi
 export CCL_ZE_IPC_EXCHANGE=sockets
 export CCL_ATL_SHM=1
 # You may need to adjust the value of
 # --max-model-len, --max-num-batched-tokens, --max-num-seqs
 # to acquire the best performance
 python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
  --served-model-name $served_model_name \
  --port 8000 \
  --model $model \
  --trust-remote-code \
  --gpu-memory-utilization 0.75 \
  --device xpu \
  --dtype float16 \
  --enforce-eager \
  --load-in-low-bit sym_int4 \
  --max-model-len 4096 \
  --max-num-batched-tokens 10240 \
  --max-num-seqs 12 \
  --tensor-parallel-size 2
 ```
 If the service have booted successfully, you should see the output similar to the following figure:
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/start-vllm-service.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/start-vllm-service.png" width=100%; />
 </a>
 ## 5.Performing benchmark
 To perform benchmark, you can use the **benchmark_throughput** script that is originally provided by vLLM repo.
 ```bash
 conda activate ipex-vllm
 source /opt/intel/oneapi/setvars.sh
 wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
 wget https://raw.githubusercontent.com/intel-analytics/ipex-llm/main/docker/llm/serving/xpu/docker/benchmark_vllm_throughput.py -O benchmark_throughput.py
 export MODEL="YOUR_MODEL"
 # You can change load-in-low-bit from values in [sym_int4, fp6, fp8, fp8_e4m3, fp16]
 python3 ./benchmark_throughput.py \
    --backend vllm \
    --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json \
    --model $MODEL \
    --num-prompts 1000 \
    --seed 42 \
    --trust-remote-code \
    --enforce-eager \
    --dtype float16 \
    --device xpu \
    --load-in-low-bit sym_int4 \
    --gpu-memory-utilization 0.85
 ```
 The following figure shows the result of benchmarking `Llama-2-7b-chat-hf` using 50 prompts:
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/vllm-benchmark-result.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/vllm-benchmark-result.png" width=100%; />
 </a>
 ```eval_rst
 .. tip::
  To find the best config that fits your workload, you may need to start the service and use tools like `wrk` or `jmeter` to perform a stress tests.
 ```
--- a/docs/mddocs/Quickstart/webui_quickstart.md
+++ b/docs/mddocs/Quickstart/webui_quickstart.md
@ -0,0 +1,217 @@
 # Run Text Generation WebUI on Intel GPU
 The [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) provides a user friendly GUI for anyone to run LLM locally; by porting it to [`ipex-llm`](https://github.com/intel-analytics/ipex-llm), users can now easily run LLM in [Text Generation WebUI](https://github.com/intel-analytics/text-generation-webui) on Intel **GPU** *(e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max)*.
 See the demo of running LLaMA2-7B on an Intel Core Ultra laptop below.
 <video src="https://llm-assets.readthedocs.io/en/latest/_images/webui-mtl.mp4" width="100%" controls></video>
 ## Quickstart
 This quickstart guide walks you through setting up and using the [Text Generation WebUI](https://github.com/intel-analytics/text-generation-webui) with `ipex-llm`. 
 A preview of the WebUI in action is shown below:
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_chat.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_chat.png" width=80%; />
 </a>
 ### 1 Install IPEX-LLM
 To use the WebUI, first ensure that IPEX-LLM is installed. Follow the instructions on the [IPEX-LLM Installation Quickstart for Windows with Intel GPU](install_windows_gpu.html). 
 **After the installation, you should have created a conda environment, named `llm` for instance, for running `ipex-llm` applications.**
 ### 2 Install the WebUI
 #### Download the WebUI
 Download the `text-generation-webui` with IPEX-LLM integrations from [this link](https://github.com/intel-analytics/text-generation-webui/archive/refs/heads/ipex-llm.zip). Unzip the content into a directory, e.g.,`C:\text-generation-webui`. 
 #### Install Dependencies
 Open **Miniforge Prompt** and activate the conda environment you have created in [section 1](#1-install-ipex-llm), e.g., `llm`. 
 ```
 conda activate llm
 ```
 Then, change to the directory of WebUI (e.g.,`C:\text-generation-webui`) and install the necessary dependencies:
 ```cmd
 cd C:\text-generation-webui
 pip install -r requirements_cpu_only.txt
 pip install -r extensions/openai/requirements.txt
 ```
 ```eval_rst
 .. note::
   `extensions/openai/requirements.txt` is for API service. If you don't need the API service, you can omit this command. 
 ```
 ### 3 Start the WebUI Server
 #### Set Environment Variables
 Configure oneAPI variables by running the following command in **Miniforge Prompt**:
 ```eval_rst
 .. note::
   For more details about runtime configurations, refer to `this guide <https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#runtime-configuration>`_ 
 ```
 ```cmd
 set SYCL_CACHE_PERSISTENT=1
 ```
 If you're running on iGPU, set additional environment variables by running the following commands:
 ```cmd
 set BIGDL_LLM_XMX_DISABLED=1
 ```
 #### Launch the Server
 In **Miniforge Prompt** with the conda environment `llm` activated, navigate to the `text-generation-webui` folder and execute the following commands (You can optionally lanch the server with or without the API service): 
 ##### without API service
   ```cmd
   python server.py --load-in-4bit
   ```
 ##### with API service
  ```
    python server.py --load-in-4bit --api --api-port 5000 --listen
  ```
 ```eval_rst
 .. note::
   with ``--load-in-4bit`` option, the models will be optimized and run at 4-bit precision. For configuration for other formats and precisions, refer to `this link <https://github.com/intel-analytics/text-generation-webui?tab=readme-ov-file#32-optimizations-for-other-percisions>`_
 ```
 ```eval_rst
 .. note::
   The API service allows user to access models using OpenAI-compatible API. For usage examples, refer to [this link](https://github.com/oobabooga/text-generation-webui/wiki/12-%E2%80%90-OpenAI-API#examples)  
 ```
 ```eval_rst
 .. note::
   The API server will by default use port ``5000``. To change the port, use ``--api-port 1234`` in the command above. You can also specify using SSL or API Key in the command. Please see `this guide <https://github.com/intel-analytics/text-generation-webui/blob/ipex-llm/docs/12%20-%20OpenAI%20API.md>`_ for the full list of arguments.
 ```
 #### Access the WebUI
 Upon successful launch, URLs to access the WebUI will be displayed in the terminal as shown below. Open the provided local URL in your browser to interact with the WebUI. 
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_launch_server.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_launch_server.png" width=100%; />
 </a>
 ### 4. Using the WebUI
 #### Model Download
 Place Huggingface models in `C:\text-generation-webui\models` by either copying locally or downloading via the WebUI. To download, navigate to the **Model** tab, enter the model's huggingface id (for instance, `microsoft/phi-1_5`) in the **Download model or LoRA** section, and click **Download**, as illustrated below. 
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_download_model.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_download_model.png" width=100%; />
 </a>
 After copying or downloading the models, click on the blue **refresh** button to update the **Model** drop-down menu. Then, choose your desired model from the newly updated list.  
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_select_model.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_select_model.png" width=100%; />
 </a>
 #### Load Model
 Default settings are recommended for most users. Click **Load** to activate the model. Address any errors by installing missing packages as prompted, and ensure compatibility with your version of the transformers package. Refer to [troubleshooting section](#troubleshooting) for more details.
 If everything goes well, you will get a message as shown below.
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_load_model_success.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_load_model_success.png" width=100%; />
 </a>
 ```eval_rst
 .. note::
   Model loading might take a few minutes as it includes a **warm-up** phase. This `warm-up` step is used to improve the speed of subsequent model uses. 
 ```
 #### Chat with the Model
 In the **Chat** tab, start new conversations with **New chat**. 
 Enter prompts into the textbox at the bottom and press the **Generate** button to receive responses.
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_chat.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_chat.png" width=100%; />
 </a>
 <!-- Notes:
 * Multi-turn conversations may consume GPU memory. You may specify the `Truncate the prompt up to this length` value in `Parameters` tab to reduce the GPU memory usage.
 * You may switch to a single-turn conversation mode by turning off `Activate text streaming` in the Parameters tab.
 * Please see [Chat-Tab Wiki](https://github.com/oobabooga/text-generation-webui/wiki/01-%E2%80%90-Chat-Tab) for more details. -->
 #### Exit the WebUI
 To shut down the WebUI server, use **Ctrl+C** in the **Miniforge Prompt** terminal where the WebUI Server is runing, then close your browser tab.
 ### 5. Advanced Usage
 #### Using Instruct mode
 Instruction-following models are models that are fine-tuned with specific prompt formats. 
 For these models, you should ideally use the `instruct` chat mode.
 Under this mode, the model receives user prompts that are formatted according to prompt formats it was trained with.
 To use `instruct` chat mode, select `chat` tab, scroll down the page, and then select `instruct` under `Mode`.
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_chat_mode_instruct.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_chat_mode_instruct.png" width=100%; />
 </a>
 When a model is loaded, its corresponding instruction template, which contains prompt formatting, is automatically loaded.
 If chat responses are poor, the loaded instruction template might be incorrect.
 In this case, go to `Parameters` tab and then `Instruction template` tab.
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_instruction_template.png" target="_blank">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_instruction_template.png" width=100%; />
 </a>
 You can verify and edit the loaded instruction template in the `Instruction template` field.
 You can also manually select an instruction template from `Saved instruction templates` and click `load` to load it into `Instruction template`.
 You can add custom template files to this list in `/instruction-templates/` [folder](https://github.com/intel-analytics/text-generation-webui/tree/ipex-llm/instruction-templates).
 <!-- For instance, the automatically loaded instruction template for `chatGLM3` model is incorrect, and you should manually select the `chatGLM3` instruction template. -->
 #### Tested models
 We have tested the following models with `ipex-llm` using Text Generation WebUI.
 | Model | Notes |
 |-------|-------|
 | llama-2-7b-chat-hf |          |
 | chatglm3-6b        | Manually load ChatGLM3 template for Instruct chat mode |
 | Mistral-7B-v0.1    |          |
 | qwen-7B-Chat       |          |
 ### Troubleshooting
 ### Potentially slower first response
 The first response to user prompt might be slower than expected, with delays of up to several minutes before the response is generated. This delay occurs because the GPU kernels require compilation and initialization, which varies across different GPU types.
 #### Missing Required Dependencies
 During model loading, you may encounter an **ImportError** like `ImportError: This modeling file requires the following packages that were not found in your environment`. This indicates certain packages required by the model are absent from your environment. Detailed instructions for installing these necessary packages can be found at the bottom of the error messages. Take the following steps to fix these errors:
 - Exit the WebUI Server by pressing **Ctrl+C** in the **Miniforge Prompt** terminal.
 - Install the missing pip packages as specified in the error message
 - Restart the WebUI Server.
 If there are still errors on missing packages, repeat the installation process for any additional required packages.
 #### Compatiblity issues
 If you encounter **AttributeError** errors like `AttributeError: 'BaichuanTokenizer' object has no attribute 'sp_model'`, it may be due to some models being incompatible with the current version of the transformers package because the models are outdated. In such instances, using a more recent model is recommended.
 <!-- 
 <a href="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_load_model_error.png">
  <img src="https://llm-assets.readthedocs.io/en/latest/_images/webui_quickstart_load_model_error.png" width=100%; />
 </a> -->