Add CPU and GPU examples for Yuan2-2B-hf (#9946)

* Add a new CPU example of Yuan2-2B-hf * Add a new CPU generate.py of Yuan2-2B-hf example * Add a new GPU example of Yuan2-2B-hf * Add Yuan2 to README table * In CPU example:1.Use English as default prompt; 2.Provide modified files in yuan2-2B-instruct * In GPU example:1.Use English as default prompt;2.Provide modified files * GPU example:update README * update Yuan2-2B-hf in README table * Add CPU example for Yuan2-2B in Pytorch-Models * Add GPU example for Yuan2-2B in Pytorch-Models * Add license in generate.py; Modify README * In GPU Add license in generate.py; Modify README * In CPU yuan2 modify README * In GPU yuan2 modify README * In CPU yuan2 modify README * In GPU example, updated the readme for Windows GPU supports * In GPU torch example, updated the readme for Windows GPU supports * GPU hf example README modified * GPU example README modified
2024-02-23 14:09:30 +08:00 · 2024-02-23 14:09:30 +08:00 · a2c1675546
commit a2c1675546
parent f1f4094a09
18 changed files with 5435 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -191,6 +191,7 @@ Over 40 models have been optimized/verified on `bigdl-llm`, including *LLaMA/LLa
 | SpeechT5 |  | [link](python/llm/example/GPU/PyTorch-Models/Model/speech-t5) |
 | Ziya-Coding-34B-v1.0 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/ziya) | |
 | Phi-2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/phi-2) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-2) |
 | Yuan2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/yuan2) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/yuan2) |
 ***For more details, please refer to the `bigdl-llm` [Document](https://test-bigdl-llm.readthedocs.io/en/main/doc/LLM/index.html), [Readme](python/llm), [Tutorial](https://github.com/intel-analytics/bigdl-llm-tutorial) and [API Doc](https://bigdl.readthedocs.io/en/latest/doc/PythonAPI/LLM/index.html).***
--- a/python/llm/README.md
+++ b/python/llm/README.md
@ -83,6 +83,7 @@ Over 20 models have been optimized/verified on `bigdl-llm`, including *LLaMA/LLa
 | SpeechT5 |  | [link](example/GPU/PyTorch-Models/Model/speech-t5) |
 | Ziya-Coding-34B-v1.0 | [link](example/CPU/HF-Transformers-AutoModels/Model/ziya) | |
 | Phi-2 | [link](example/CPU/HF-Transformers-AutoModels/Model/phi-2) | [link](example/GPU/HF-Transformers-AutoModels/Model/phi-2) |
 | Yuan2 | [link](example/CPU/HF-Transformers-AutoModels/Model/yuan2) | [link](example/GPU/HF-Transformers-AutoModels/Model/yuan2) |
 ### Working with `bigdl-llm`
--- a/python/llm/example/CPU/HF-Transformers-AutoModels/Model/yuan2/README.md
+++ b/python/llm/example/CPU/HF-Transformers-AutoModels/Model/yuan2/README.md
@ -0,0 +1,65 @@
 # Yuan2
 In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on Yuan2 models. For illustration purposes, we utilize the [IEITYuan/Yuan2-2B-hf](https://huggingface.co/IEITYuan/Yuan2-2B-hf) as a reference Yuan2 model.
 ## 0. Requirements
 To run these examples with BigDL-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
 In addition, you need to modify some files in Yuan2-2B-hf folder, since Flash attention dependency is for CUDA usage and currently cannot be installed on Intel CPUs. To manually turn it off, please refer to [this issue](https://github.com/IEIT-Yuan/Yuan-2.0/issues/92). We also provide two modified files([config.json](yuan2-2B-instruct/config.json) and [yuan_hf_model.py](yuan2-2B-instruct/yuan_hf_model.py)), which can be used to replace the original content in config.json and yuan_hf_model.py.
 ## Example: Predict Tokens using `generate()` API
 In the example [generate.py](./generate.py), we show a basic use case for an Yuan2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations.
 ### 1. Install
 We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 After installing conda, create a Python environment for BigDL-LLM:
 ```bash
 conda create -n llm python=3.9
 conda activate llm
 pip install --pre --upgrade bigdl-llm[all] # install the latest bigdl-llm nightly build with 'all' option
 pip install einops # additional package required for Yuan2 to conduct generation
 pip install pandas # additional package required for Yuan2 to conduct generation
 ```
 ### 2. Run
 ```
 python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 ```
 Arguments info:
 - `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Yuan2 model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'IEITYuan/Yuan2-2B-hf'`.
 - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'IEITYuan/Yuan2-2B-hf'`.
 - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `100`.
 > **Note**: When loading the model in 4-bit, BigDL-LLM converts linear layers in the model into INT4 format. In theory, a *X*B model saved in 16-bit will requires approximately 2*X* GB of memory for loading, and ~0.5*X* GB memory for further inference.
 >
 > Please select the appropriate size of the Yuan2 model based on the capabilities of your machine.
 #### 2.1 Client
 On client Windows machine, it is recommended to run directly with full utilization of all cores:
 ```powershell
 python ./generate.py
 ```
 #### 2.2 Server
 For optimal performance on server, it is recommended to set several environment variables (refer to [here](../README.md#best-known-configuration-on-linux) for more information), and run the example with all the physical cores of a single socket.
 E.g. on Linux,
 ```bash
 # set BigDL-LLM env variables
 source bigdl-llm-init
 # e.g. for a server with 48 cores per socket
 export OMP_NUM_THREADS=48
 numactl -C 0-47 -m 0 python ./generate.py
 ```
 #### 2.3 Sample Output
 #### [IEITYuan/Yuan2-2B-hf](https://huggingface.co/IEITYuan/Yuan2-2B-hf)
 ```log
 Inference time: xxxx seconds
 -------------------- Output --------------------
 What is AI?
 AI is what we call "Artificial Intelligence."<eod>
 ```
--- a/python/llm/example/CPU/HF-Transformers-AutoModels/Model/yuan2/generate.py
+++ b/python/llm/example/CPU/HF-Transformers-AutoModels/Model/yuan2/generate.py
@ -0,0 +1,67 @@
 #
 # Copyright 2016 The BigDL Authors.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 import torch, transformers
 import sys, os, time
 import argparse
 from transformers import LlamaTokenizer
 from bigdl.llm.transformers import AutoModelForCausalLM
 # Refer to https://huggingface.co/IEITYuan/Yuan2-2B-hf#Usage
 YUAN2_PROMPT_FORMAT = """
 {prompt}
 """
 if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Generate text using Yuan2-2B model')
    parser.add_argument('--repo-id-or-model-path', type=str, default="IEITYuan/Yuan2-2B-hf",
                        help='The huggingface repo id for the Yuan2 to be downloaded'
                             ', or the path to the huggingface checkpoint folder')
    parser.add_argument('--prompt', type=str, default="What is AI?",
                        help='Prompt for the model')
    parser.add_argument('--n-predict', type=int, default=100,
                        help='Number of tokens to generate')
    args = parser.parse_args()
    model_path = args.repo_id_or_model_path
    # Load tokenizer
    print("Creating tokenizer...")
    tokenizer = LlamaTokenizer.from_pretrained(model_path, add_eos_token=False, add_bos_token=False, eos_token='<eod>')
    tokenizer.add_tokens(['<sep>', '<pad>', '<mask>', '<predict>', '<FIM_SUFFIX>', '<FIM_PREFIX>', '<FIM_MIDDLE>','<commit_before>',
                          '<commit_msg>','<commit_after>','<jupyter_start>','<jupyter_text>','<jupyter_code>','<jupyter_output>','<empty_output>'], special_tokens=True)
    # Load model in 4 bit,
    # which convert the relevant layers in the model into INT4 format
    print("Creating model...")
    model = AutoModelForCausalLM.from_pretrained(model_path, device_map="cpu", trust_remote_code=True, load_in_4bit=True).eval()
    prompt = YUAN2_PROMPT_FORMAT.format(prompt=args.prompt)
    inputs = tokenizer(prompt, return_tensors="pt")["input_ids"]
    # Measure the inference time
    start_time = time.time()
    # if your selected model is capable of utilizing previous key/value attentions
    # to enhance decoding speed, but has `"use_cache": false` in its model config,
    # it is important to set `use_cache=True` explicitly in the `generate` function
    # to obtain optimal performance with BigDL-LLM INT4 optimizations
    outputs = model.generate(inputs, do_sample=True, top_k=5, max_length=args.n_predict)
    end_time = time.time()
    output_str = tokenizer.decode(outputs[0])
    print(f'Inference time: {end_time - start_time} seconds')
    print('-'*20, 'Output', '-'*20)
    print(output_str)
--- a/python/llm/example/CPU/HF-Transformers-AutoModels/Model/yuan2/yuan2-2B-instruct/config.json
+++ b/python/llm/example/CPU/HF-Transformers-AutoModels/Model/yuan2/yuan2-2B-instruct/config.json
@ -0,0 +1,39 @@
 {
    "_from_model_config":true,
  "architectures": [
    "YuanForCausalLM"
  ],
  "auto_map":{
          "AutoConfig":"configuration_yuan.YuanConfig",
          "AutoModelForCausalLM":"yuan_hf_model.YuanForCausalLM"
  },
  "tokenizer_class":"YuanTokenizer",
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 8192,
  "model_type": "yuan",
  "num_attention_heads": 32,
  "num_hidden_layers": 24,
  "rms_norm_eps": 1e-06,
  "dropout": 0.1,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.30.0.dev0",
  "use_cache": true,
  "causal_mask": true,
  "use_flash_attention": false,
  "reset_attention_mask": true,
  "reset_position_ids": true,
  "use_loss_mask": false,
  "eod_token": 77185,
  "sep_token": 77187,
  "eod_token_id": 77185,
  "sep_token_id": 77185,
  "pad_token_id": 77185,
  "bos_token_id": 77185,
  "eos_token_id": 77185,
  "mask_token_id": 77185,
  "vocab_size": 135040
 }
--- a/python/llm/example/CPU/HF-Transformers-AutoModels/Model/yuan2/yuan2-2B-instruct/yuan_hf_model.py
+++ b/python/llm/example/CPU/HF-Transformers-AutoModels/Model/yuan2/yuan2-2B-instruct/yuan_hf_model.py
--- a/python/llm/example/CPU/PyTorch-Models/Model/yuan2/README.md
+++ b/python/llm/example/CPU/PyTorch-Models/Model/yuan2/README.md
@ -0,0 +1,61 @@
 # Yuan2
 In this directory, you will find examples on how you could apply BigDL-LLM `optimize_model` API to accelerate Yuan2 models. For illustration purposes, we utilize the [IEITYuan/Yuan2-2B-hf](https://huggingface.co/IEITYuan/Yuan2-2B-hf) as a reference Yuan2 model.
 ## 0. Requirements
 To run these examples with BigDL-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
 In addition, you need to modify some files in Yuan2-2B-hf folder, since Flash attention dependency is for CUDA usage and currently cannot be installed on Intel CPUs. To manually turn it off, please refer to [this issue](https://github.com/IEIT-Yuan/Yuan-2.0/issues/92). We also provide two modified files([config.json](yuan2-2B-instruct/config.json) and [yuan_hf_model.py](yuan2-2B-instruct/yuan_hf_model.py)), which can be used to replace the original content in config.json and yuan_hf_model.py.
 ## Example: Predict Tokens using `generate()` API
 In the example [generate.py](./generate.py), we show a basic use case for an Yuan2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations.
 ### 1. Install
 We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 After installing conda, create a Python environment for BigDL-LLM:
 ```bash
 conda create -n llm python=3.9
 conda activate llm
 pip install --pre --upgrade bigdl-llm[all] # install the latest bigdl-llm nightly build with 'all' option
 pip install einops # additional package required for Yuan2 to conduct generation
 pip install pandas # additional package required for Yuan2 to conduct generation
 ```
 ### 2. Run
 ```
 python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
 ```
 Arguments info:
 - `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Yuan2 model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'IEITYuan/Yuan2-2B-hf'`.
 - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'IEITYuan/Yuan2-2B-hf'`.
 - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `100`.
 #### 2.1 Client
 On client Windows machine, it is recommended to run directly with full utilization of all cores:
 ```powershell
 python ./generate.py
 ```
 #### 2.2 Server
 For optimal performance on server, it is recommended to set several environment variables (refer to [here](../README.md#best-known-configuration-on-linux) for more information), and run the example with all the physical cores of a single socket.
 E.g. on Linux,
 ```bash
 # set BigDL-LLM env variables
 source bigdl-llm-init
 # e.g. for a server with 48 cores per socket
 export OMP_NUM_THREADS=48
 numactl -C 0-47 -m 0 python ./generate.py
 ```
 #### 2.3 Sample Output
 #### [IEITYuan/Yuan2-2B-hf](https://huggingface.co/IEITYuan/Yuan2-2B-hf)
 ```log
 Inference time: xxxx seconds
 -------------------- Output --------------------
 What is AI?
 The term "AI" refers to a process that involves creating machines or devices that can perform tasks that typically require human intelligence, such as AI-based decision-making and machine learning. AI is rapidly advancing in the fields of machine learning, computer science, and artificial intelligence, and has been used in various fields to achieve various goals, such as improving accuracy, efficiency, and complexity. However, the
 ```
--- a/python/llm/example/CPU/PyTorch-Models/Model/yuan2/generate.py
+++ b/python/llm/example/CPU/PyTorch-Models/Model/yuan2/generate.py
@ -0,0 +1,69 @@
 #
 # Copyright 2016 The BigDL Authors.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 import torch, transformers
 import sys, os, time
 import argparse
 from transformers import LlamaTokenizer, AutoModelForCausalLM
 from bigdl.llm import optimize_model
 # Refer to https://huggingface.co/IEITYuan/Yuan2-2B-hf#Usage
 YUAN2_PROMPT_FORMAT = """
 {prompt}
 """
 if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Generate text using Yuan2-2B model')
    parser.add_argument('--repo-id-or-model-path', type=str, default="IEITYuan/Yuan2-2B-hf",
                        help='The huggingface repo id for the Yuan2 to be downloaded'
                             ', or the path to the huggingface checkpoint folder')
    parser.add_argument('--prompt', type=str, default="What is AI?",
                        help='Prompt for the model')
    parser.add_argument('--n-predict', type=int, default=100,
                        help='Number of tokens to generate')
    args = parser.parse_args()
    model_path = args.repo_id_or_model_path
    # Load tokenizer
    print("Creating tokenizer...")
    tokenizer = LlamaTokenizer.from_pretrained(model_path, add_eos_token=False, add_bos_token=False, eos_token='<eod>')
    tokenizer.add_tokens(['<sep>', '<pad>', '<mask>', '<predict>', '<FIM_SUFFIX>', '<FIM_PREFIX>', '<FIM_MIDDLE>','<commit_before>',
                          '<commit_msg>','<commit_after>','<jupyter_start>','<jupyter_text>','<jupyter_code>','<jupyter_output>','<empty_output>'], special_tokens=True)
    # Load model
    print("Creating model...")
    model = AutoModelForCausalLM.from_pretrained(model_path, device_map="cpu", trust_remote_code=True, torch_dtype=torch.float16).eval()
    # With only one line to enable BigDL-LLM optimization on model
    model = optimize_model(model)
    prompt = YUAN2_PROMPT_FORMAT.format(prompt=args.prompt)
    inputs = tokenizer(prompt, return_tensors="pt")["input_ids"]
    # Measure the inference time
    start_time = time.time()
    # if your selected model is capable of utilizing previous key/value attentions
    # to enhance decoding speed, but has `"use_cache": false` in its model config,
    # it is important to set `use_cache=True` explicitly in the `generate` function
    # to obtain optimal performance with BigDL-LLM INT4 optimizations
    outputs = model.generate(inputs, do_sample=True, top_k=5, max_length=args.n_predict)
    end_time = time.time()
    output_str = tokenizer.decode(outputs[0])
    print(f'Inference time: {end_time - start_time} seconds')
    print('-'*20, 'Output', '-'*20)
    print(output_str)
--- a/python/llm/example/CPU/PyTorch-Models/Model/yuan2/yuan2-2B-instruct/config.json
+++ b/python/llm/example/CPU/PyTorch-Models/Model/yuan2/yuan2-2B-instruct/config.json
@ -0,0 +1,39 @@
 {
    "_from_model_config":true,
  "architectures": [
    "YuanForCausalLM"
  ],
  "auto_map":{
          "AutoConfig":"configuration_yuan.YuanConfig",
          "AutoModelForCausalLM":"yuan_hf_model.YuanForCausalLM"
  },
  "tokenizer_class":"YuanTokenizer",
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 8192,
  "model_type": "yuan",
  "num_attention_heads": 32,
  "num_hidden_layers": 24,
  "rms_norm_eps": 1e-06,
  "dropout": 0.1,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.30.0.dev0",
  "use_cache": true,
  "causal_mask": true,
  "use_flash_attention": false,
  "reset_attention_mask": true,
  "reset_position_ids": true,
  "use_loss_mask": false,
  "eod_token": 77185,
  "sep_token": 77187,
  "eod_token_id": 77185,
  "sep_token_id": 77185,
  "pad_token_id": 77185,
  "bos_token_id": 77185,
  "eos_token_id": 77185,
  "mask_token_id": 77185,
  "vocab_size": 135040
 }
--- a/python/llm/example/CPU/PyTorch-Models/Model/yuan2/yuan2-2B-instruct/yuan_hf_model.py
+++ b/python/llm/example/CPU/PyTorch-Models/Model/yuan2/yuan2-2B-instruct/yuan_hf_model.py
--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/yuan2/README.md
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/yuan2/README.md
@ -0,0 +1,119 @@
 # Yuan2
 In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on Yuan2 models on [Intel GPUs](../README.md). For illustration purposes, we utilize the [IEITYuan/Yuan2-2B-hf](https://huggingface.co/IEITYuan/Yuan2-2B-hf) as a reference Yuan2 model.
 ## 0. Requirements
 To run these examples with BigDL-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
 In addition, you need to modify some files in Yuan2-2B-hf folder, since Flash attention dependency is for CUDA usage and currently cannot be installed on Intel CPUs. To manually turn it off, please refer to [this issue](https://github.com/IEIT-Yuan/Yuan-2.0/issues/92). We also provide two modified files([config.json](yuan2-2B-instruct/config.json) and [yuan_hf_model.py](yuan2-2B-instruct/yuan_hf_model.py)), which can be used to replace the original content in config.json and yuan_hf_model.py.
 ## Example: Predict Tokens using `generate()` API
 In the example [generate.py](./generate.py), we show a basic use case for an Yuan2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 ### 1. Install
 #### 1.1 Installation on Linux
 We suggest using conda to manage environment:
 ```bash
 conda create -n llm python=3.9
 conda activate llm
 # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade bigdl-llm[all] # install the latest bigdl-llm nightly build with 'all' option
 pip install einops # additional package required for Yuan2 to conduct generation
 pip install pandas # additional package required for Yuan2 to conduct generation
 ```
 #### 1.2 Installation on Windows
 We suggest using conda to manage environment:
 ```bash
 conda create -n llm python=3.9 libuv
 conda activate llm
 # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 pip install einops # additional package required for Yuan2 to conduct generation
 ```
 ### 2. Configures OneAPI environment variables
 #### 2.1 Configurations for Linux
 ```bash
 source /opt/intel/oneapi/setvars.sh
 ```
 #### 2.2 Configurations for Windows
 ```cmd
 call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 ```
 > Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 ### 3. Runtime Configurations
 For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 #### 3.1 Configurations for Linux
 <details>
 <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 ```bash
 export USE_XETLA=OFF
 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 ```
 </details>
 <details>
 <summary>For Intel Data Center GPU Max Series</summary>
 ```bash
 export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 export ENABLE_SDP_FUSION=1
 ```
 > Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 </details>
 #### 3.2 Configurations for Windows
 <details>
 <summary>For Intel iGPU</summary>
 ```cmd
 set SYCL_CACHE_PERSISTENT=1
 set BIGDL_LLM_XMX_DISABLED=1
 ```
 </details>
 <details>
 <summary>For Intel Arc™ A300-Series or Pro A60</summary>
 ```cmd
 set SYCL_CACHE_PERSISTENT=1
 ```
 </details>
 <details>
 <summary>For other Intel dGPU Series</summary>
 There is no need to set further environment variables.
 </details>
 > Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 ### 4. Running examples
 ```bash
 python ./generate.py
 ```
 In the example, several arguments can be passed to satisfy your requirements:
 - `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Yuan2 model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'IEITYuan/Yuan2-2B-hf'`.
 - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`.
 - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `100`.
 #### Sample Output
 #### [IEITYuan/Yuan2-2B-hf](https://huggingface.co/IEITYuan/Yuan2-2B-hf)
 ```log
 Inference time: xxxx seconds
 -------------------- Output --------------------
 What is AI?
 AI is a field of technology and technologies that is used to analyze and improve human behavior such as language processing, machine learning and artificial intelligence (AI).<eod>
 ```
--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/yuan2/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/yuan2/generate.py
@ -0,0 +1,78 @@
 #
 # Copyright 2016 The BigDL Authors.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 import torch, transformers
 import sys, os, time
 import intel_extension_for_pytorch as ipex
 import argparse
 from transformers import LlamaTokenizer
 from bigdl.llm.transformers import AutoModelForCausalLM
 # Refer to https://huggingface.co/IEITYuan/Yuan2-2B-hf#Usage
 YUAN2_PROMPT_FORMAT = """
 {prompt}
 """
 if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Generate text using Yuan2-2B model')
    parser.add_argument('--repo-id-or-model-path', type=str, default="IEITYuan/Yuan2-2B-hf",
                        help='The huggingface repo id for the Yuan2 to be downloaded'
                             ', or the path to the huggingface checkpoint folder')
    parser.add_argument('--prompt', type=str, default="What is AI?",
                        help='Prompt for the model')
    parser.add_argument('--n-predict', type=int, default=100,
                        help='Number of tokens to generate')
    args = parser.parse_args()
    model_path = args.repo_id_or_model_path
    # Load tokenizer
    print("Creating tokenizer...")
    tokenizer = LlamaTokenizer.from_pretrained(model_path, add_eos_token=False, add_bos_token=False, eos_token='<eod>')
    tokenizer.add_tokens(['<sep>', '<pad>', '<mask>', '<predict>', '<FIM_SUFFIX>', '<FIM_PREFIX>', '<FIM_MIDDLE>','<commit_before>',
                          '<commit_msg>','<commit_after>','<jupyter_start>','<jupyter_text>','<jupyter_code>','<jupyter_output>','<empty_output>'], special_tokens=True)
    # Load model in 4 bit,
    # which convert the relevant layers in the model into INT4 format
    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
    print("Creating model...")
    model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True).eval()
    # Convert the model to xpu
    model = model.to('xpu')
    prompt = YUAN2_PROMPT_FORMAT.format(prompt=args.prompt)
    inputs = tokenizer(prompt, return_tensors="pt")["input_ids"]
    # Convert the inputs to xpu
    inputs = inputs.to('xpu')
    # Default warmup since the first generate() is slow
    outputs = model.generate(inputs, do_sample=True, top_k=5, max_length=args.n_predict)
    print('Finish warmup')
    # Measure the inference time
    start_time = time.time()
    # if your selected model is capable of utilizing previous key/value attentions
    # to enhance decoding speed, but has `"use_cache": false` in its model config,
    # it is important to set `use_cache=True` explicitly in the `generate` function
    # to obtain optimal performance with BigDL-LLM INT4 optimizations
    outputs = model.generate(inputs, do_sample=True, top_k=5, max_length=args.n_predict)
    end_time = time.time()
    output_str = tokenizer.decode(outputs[0])
    print(f'Inference time: {end_time - start_time} seconds')
    print('-'*20, 'Output', '-'*20)
    print(output_str)
--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/yuan2/yuan2-2B-instruct/config.json
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/yuan2/yuan2-2B-instruct/config.json
@ -0,0 +1,39 @@
 {
    "_from_model_config":true,
  "architectures": [
    "YuanForCausalLM"
  ],
  "auto_map":{
          "AutoConfig":"configuration_yuan.YuanConfig",
          "AutoModelForCausalLM":"yuan_hf_model.YuanForCausalLM"
  },
  "tokenizer_class":"YuanTokenizer",
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 8192,
  "model_type": "yuan",
  "num_attention_heads": 32,
  "num_hidden_layers": 24,
  "rms_norm_eps": 1e-06,
  "dropout": 0.1,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.30.0.dev0",
  "use_cache": true,
  "causal_mask": true,
  "use_flash_attention": false,
  "reset_attention_mask": true,
  "reset_position_ids": true,
  "use_loss_mask": false,
  "eod_token": 77185,
  "sep_token": 77187,
  "eod_token_id": 77185,
  "sep_token_id": 77185,
  "pad_token_id": 77185,
  "bos_token_id": 77185,
  "eos_token_id": 77185,
  "mask_token_id": 77185,
  "vocab_size": 135040
 }
--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/yuan2/yuan2-2B-instruct/yuan_hf_model.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/yuan2/yuan2-2B-instruct/yuan_hf_model.py
--- a/python/llm/example/GPU/PyTorch-Models/Model/yuan2/README.md
+++ b/python/llm/example/GPU/PyTorch-Models/Model/yuan2/README.md
@ -0,0 +1,122 @@
 # Yuan2
 In this directory, you will find examples on how you could apply BigDL-LLM `optimize_model` API to accelerate Yuan2 models on [Intel GPUs](../README.md). For illustration purposes, we utilize the [IEITYuan/Yuan2-2B-hf](https://huggingface.co/IEITYuan/Yuan2-2B-hf) as a reference Yuan2 model.
 ## 0. Requirements
 To run these examples with BigDL-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
 In addition, you need to modify some files in Yuan2-2B-hf folder, since Flash attention dependency is for CUDA usage and currently cannot be installed on Intel CPUs. To manually turn it off, please refer to [this issue](https://github.com/IEIT-Yuan/Yuan-2.0/issues/92). We also provide two modified files([config.json](yuan2-2B-instruct/config.json) and [yuan_hf_model.py](yuan2-2B-instruct/yuan_hf_model.py)), which can be used to replace the original content in config.json and yuan_hf_model.py.
 ## Example: Predict Tokens using `generate()` API
 In the example [generate.py](./generate.py), we show a basic use case for an Yuan2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
 ### 1. Install
 #### 1.1 Installation on Linux
 We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
 After installing conda, create a Python environment for BigDL-LLM:
 ```bash
 conda create -n llm python=3.9
 conda activate llm
 pip install --pre --upgrade bigdl-llm[all] # install the latest bigdl-llm nightly build with 'all' option
 pip install einops # additional package required for Yuan2 to conduct generation
 pip install pandas # additional package required for Yuan2 to conduct generation
 ```
 #### 1.2 Installation on Windows
 We suggest using conda to manage environment:
 ```bash
 conda create -n llm python=3.9 libuv
 conda activate llm
 # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
 pip install einops # additional package required for Yuan2 to conduct generation
 ```
 ### 2. Configures OneAPI environment variables
 #### 2.1 Configurations for Linux
 ```bash
 source /opt/intel/oneapi/setvars.sh
 ```
 #### 2.2 Configurations for Windows
 ```cmd
 call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
 ```
 > Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
 ### 3. Runtime Configurations
 For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
 #### 3.1 Configurations for Linux
 <details>
 <summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
 For optimal performance on Arc, it is recommended to set several environment variables.
 ```bash
 export USE_XETLA=OFF
 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 ```
 </details>
 <details>
 <summary>For Intel Data Center GPU Max Series</summary>
 ```bash
 export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
 export ENABLE_SDP_FUSION=1
 ```
 > Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
 </details>
 #### 3.2 Configurations for Windows
 <details>
 <summary>For Intel iGPU</summary>
 ```cmd
 set SYCL_CACHE_PERSISTENT=1
 set BIGDL_LLM_XMX_DISABLED=1
 ```
 </details>
 <details>
 <summary>For Intel Arc™ A300-Series or Pro A60</summary>
 ```cmd
 set SYCL_CACHE_PERSISTENT=1
 ```
 </details>
 <details>
 <summary>For other Intel dGPU Series</summary>
 There is no need to set further environment variables.
 </details>
 > Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
 ### 4. Running examples
 ```bash
 python ./generate.py
 ```
 In the example, several arguments can be passed to satisfy your requirements:
 - `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Yuan2 model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'IEITYuan/Yuan2-2B-hf'`.
 - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`.
 - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `100`.
 #### Sample Output
 #### [IEITYuan/Yuan2-2B-hf](https://huggingface.co/IEITYuan/Yuan2-2B-hf)
 ```log
 Inference time: xxxx seconds
 -------------------- Output --------------------
 What is AI?
 AI is the process of creating machines that can interact with humans with their minds and learn and understand them. It enables us to think about ideas and ideas, and then we can analyze them and come up with new ideas. It's not so much that you need to be an AI as an individual, you can be an AI, just as you think.<sep> 人工智能（AI）是一种计算机程序，它可以帮助我们思考和学习，从而让我们更好地理解人类的
 ```
--- a/python/llm/example/GPU/PyTorch-Models/Model/yuan2/generate.py
+++ b/python/llm/example/GPU/PyTorch-Models/Model/yuan2/generate.py
@ -0,0 +1,80 @@
 #
 # Copyright 2016 The BigDL Authors.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 import torch, transformers
 import sys, os, time
 import intel_extension_for_pytorch as ipex
 import argparse
 from transformers import LlamaTokenizer, AutoModelForCausalLM
 from bigdl.llm import optimize_model
 # Refer to https://huggingface.co/IEITYuan/Yuan2-2B-hf#Usage
 YUAN2_PROMPT_FORMAT = """
 {prompt}
 """
 if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Generate text using Yuan2-2B model')
    parser.add_argument('--repo-id-or-model-path', type=str, default="IEITYuan/Yuan2-2B-hf",
                        help='The huggingface repo id for the Yuan2 to be downloaded'
                             ', or the path to the huggingface checkpoint folder')
    parser.add_argument('--prompt', type=str, default="What is AI?",
                        help='Prompt for the model')
    parser.add_argument('--n-predict', type=int, default=100,
                        help='Number of tokens to generate')
    args = parser.parse_args()
    model_path = args.repo_id_or_model_path
    # Load tokenizer
    print("Creating tokenizer...")
    tokenizer = LlamaTokenizer.from_pretrained(model_path, add_eos_token=False, add_bos_token=False, eos_token='<eod>')
    tokenizer.add_tokens(['<sep>', '<pad>', '<mask>', '<predict>', '<FIM_SUFFIX>', '<FIM_PREFIX>', '<FIM_MIDDLE>','<commit_before>',
                          '<commit_msg>','<commit_after>','<jupyter_start>','<jupyter_text>','<jupyter_code>','<jupyter_output>','<empty_output>'], special_tokens=True)
    # Load model
    print("Creating model...")
    model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, torch_dtype='auto', low_cpu_mem_usage=True).eval()
    # With only one line to enable BigDL-LLM optimization on model
    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
    model = optimize_model(model)
    # Convert the model to xpu
    model = model.to('xpu')
    prompt = YUAN2_PROMPT_FORMAT.format(prompt=args.prompt)
    inputs = tokenizer(prompt, return_tensors="pt")["input_ids"]
    # Convert the inputs to xpu
    inputs = inputs.to('xpu')
    # Default warmup since the first generate() is slow
    outputs = model.generate(inputs, do_sample=True, top_k=5, max_length=args.n_predict)
    print('Finish warmup')
    # Measure the inference time
    start_time = time.time()
    # if your selected model is capable of utilizing previous key/value attentions
    # to enhance decoding speed, but has `"use_cache": false` in its model config,
    # it is important to set `use_cache=True` explicitly in the `generate` function
    # to obtain optimal performance with BigDL-LLM INT4 optimizations
    outputs = model.generate(inputs, do_sample=True, top_k=5, max_length=args.n_predict)
    end_time = time.time()
    output_str = tokenizer.decode(outputs[0])
    print(f'Inference time: {end_time - start_time} seconds')
    print('-'*20, 'Output', '-'*20)
    print(output_str)
--- a/python/llm/example/GPU/PyTorch-Models/Model/yuan2/yuan2-2B-instruct/config.json
+++ b/python/llm/example/GPU/PyTorch-Models/Model/yuan2/yuan2-2B-instruct/config.json
@ -0,0 +1,39 @@
 {
    "_from_model_config":true,
  "architectures": [
    "YuanForCausalLM"
  ],
  "auto_map":{
          "AutoConfig":"configuration_yuan.YuanConfig",
          "AutoModelForCausalLM":"yuan_hf_model.YuanForCausalLM"
  },
  "tokenizer_class":"YuanTokenizer",
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 8192,
  "model_type": "yuan",
  "num_attention_heads": 32,
  "num_hidden_layers": 24,
  "rms_norm_eps": 1e-06,
  "dropout": 0.1,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.30.0.dev0",
  "use_cache": true,
  "causal_mask": true,
  "use_flash_attention": false,
  "reset_attention_mask": true,
  "reset_position_ids": true,
  "use_loss_mask": false,
  "eod_token": 77185,
  "sep_token": 77187,
  "eod_token_id": 77185,
  "sep_token_id": 77185,
  "pad_token_id": 77185,
  "bos_token_id": 77185,
  "eos_token_id": 77185,
  "mask_token_id": 77185,
  "vocab_size": 135040
 }
--- a/python/llm/example/GPU/PyTorch-Models/Model/yuan2/yuan2-2B-instruct/yuan_hf_model.py
+++ b/python/llm/example/GPU/PyTorch-Models/Model/yuan2/yuan2-2B-instruct/yuan_hf_model.py